[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585438#comment-16585438
 ] 

Steve Loughran commented on SPARK-25155:


>From SPARK-17159 I have a more cloud-optimized stream source here 
>[CloudInputDStream|https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala];
> see what you think and pick it up if you want to work on merging it in.

Optimising the scanning is good, but what the stores really need is listening 
to the event streams of new files arriving; AWS S3 and Azure both support that, 
don't know about the others. A stream source which worked with them would be 
*very* useful to people running on those infras, as every GET/HEAD/LIST runs up 
a bill and has pretty tangible latency too

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Gil Vernik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585388#comment-16585388
 ] 

Gil Vernik commented on SPARK-25155:


[~ste...@apache.org] what is your input on the issue i reported? You think it's 
a bug? what do you think on the proposed change? Thanks

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Gil Vernik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585384#comment-16585384
 ] 

Gil Vernik commented on SPARK-25155:


[~dongjoon] of course, committers decide versions, when to merge,  etc. At this 
point i just need to get a general input on this issue. If all agree it's a bug 
and my proposed change is fine - then i can also submit the patch

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585381#comment-16585381
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22148

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing

2018-08-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585374#comment-16585374
 ] 

Hyukjin Kwon commented on SPARK-25147:
--

It works on:

Python 2.7.14
Pandas 0.19.2
PyArrow 0.9.0
Numpy 1.11.3

Sounds specific to PyArrow or Pandas version from a cursery look.

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25144.
--
Resolution: Cannot Reproduce

{code}
scala> case class Foo(bar: Option[String])
defined class Foo

scala> val ds = List(Foo(Some("bar"))).toDS
ds: org.apache.spark.sql.Dataset[Foo] = [bar: string]

scala> val result = ds.flatMap(_.bar).distinct
result: org.apache.spark.sql.Dataset[String] = [value: string]

scala> result.rdd.isEmpty
res0: Boolean = false
{code}

I tried in the master. i am unable to reproduce this. Resolving this. It would 
be nicer if we identify and see if we can backport. Please link this to the 
duplicate if it's identified.

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.3.1
> Environment: spark 2.3.1
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 

[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25134:
-
Target Version/s:   (was: 2.4.0)

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Priority: Minor
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585360#comment-16585360
 ] 

Dongjoon Hyun commented on SPARK-25149:
---

Thank you for the contribution, [~bago.amirbekian]. `Fix Version` will be set 
later when this is fixed. (http://spark.apache.org/contributing.html 'JIRA' 
section)

> Personalized Page Rank raises an error if vertexIDs are > MaxInt
> 
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25149:
--
Fix Version/s: (was: 2.4.0)

> Personalized Page Rank raises an error if vertexIDs are > MaxInt
> 
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585355#comment-16585355
 ] 

Dongjoon Hyun commented on SPARK-25155:
---

Hi, [~gvernik]. Thank you for reporting, but `Fix Version` and `Target Version` 
will be handled by the Apache Spark committers later.
- http://spark.apache.org/contributing.html  (`JIRA` section).

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25155:
--
Target Version/s:   (was: 2.3.1)

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25155) Streaming from storage doesn't work when no directories exists

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25155:
--
Fix Version/s: (was: 2.3.2)

> Streaming from storage doesn't work when no directories exists
> --
>
> Key: SPARK-25155
> URL: https://issues.apache.org/jira/browse/SPARK-25155
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Gil Vernik
>Priority: Minor
>
> I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` 
> method `findNewFiles`.
> Streaming for the giving path suppose to pickup new files only ( based on the 
> previous run timestamp ). However the code in Spark will first obtain 
> directories, then for each directory will find new files. Here is the 
> relevant code:
> *val* directoryFilter = *new* PathFilter
> {   *override def* accept(path: Path): Boolean = 
> fs.getFileStatus(path).isDirectory }
> *val* directories = fs.globStatus(directoryPath, 
> directoryFilter).map(_.getPath)
> *val* newFiles = directories.flatMap(dir =>
>   fs.listStatus(dir, newFileFilter).map(_.getPath.toString))
>  
> This is not optimized, as it always requires two accesses.  In addition this 
> seems to be  buggy
> I have an S3 bucket “mydata” with  objects “a.csv”, “b.csv”. I noticed that   
> fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 
> directories and so “a.csv”, “b.csv” were not picked by Spark.
> I tried to make path as “[s3a://mydata/*]” and it didn't worked also.
> I experienced the same problematic behavior with the file system when tried 
> to stream from “/Users/streaming/*”
>  I suggest to change the code in Spark so it will perform first list without 
> directoryFilter, which seems not needed at all. The code could  be
> *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath)
> The flow would be ( for each entry in  directoriesOrfiles )
>  * If data object: Spark will apply newFileFilter on the returned objects
>  * If directory: then the existing  code will perform additional listing at 
> the directory level
> This way it will pick up files from the root of path and the content of 
> directories



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25156) Same query returns different result

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-25156.
-

> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> before execute query.
> A, B Table has lot of rows but C Table has small dataset, so when i saw 
> physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
> performed with Broadcast hash join.
>  
> And now, i removed set spark.sql.shuffle.partitions statement, it works fine.
> Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25156) Same query returns different result

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25156.
---
Resolution: Duplicate

For now, I'll close this as DUPLICATE of SPARK-23207. Please refer SPARK-23207 
to see the progress.

> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> before execute query.
> A, B Table has lot of rows but C Table has small dataset, so when i saw 
> physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
> performed with Broadcast hash join.
>  
> And now, i removed set spark.sql.shuffle.partitions statement, it works fine.
> Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25156) Same query returns different result

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585350#comment-16585350
 ] 

Dongjoon Hyun commented on SPARK-25156:
---

Hi, [~leeyh0216]. Yes. It looks like SPARK-23207 and SPARK-23243.

> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> before execute query.
> A, B Table has lot of rows but C Table has small dataset, so when i saw 
> physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
> performed with Broadcast hash join.
>  
> And now, i removed set spark.sql.shuffle.partitions statement, it works fine.
> Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-08-19 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25145.
---
Resolution: Cannot Reproduce

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> +---+---+
> Read entire table with "filterPushdown"=True
> +---+---+
> | a| b|
> +---+---+
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> | 4|2.0|
> | 0|0.0|
> 

[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585347#comment-16585347
 ] 

Dongjoon Hyun commented on SPARK-25145:
---

Until now, I cannot reproduce this in Apache Spark. In addition, `Buffer size 
too small` means usually a configuration issue. If you hit this issue again, 
you may increase `orc.compress.size` (and set `orc.buffer.size.enforce=true` if 
needed).

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 

[jira] [Resolved] (SPARK-25131) Event logs missing applicationAttemptId for SparkListenerApplicationStart

2018-08-19 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S resolved SPARK-25131.
-
Resolution: Not A Problem

> Event logs missing applicationAttemptId for SparkListenerApplicationStart
> -
>
> Key: SPARK-25131
> URL: https://issues.apache.org/jira/browse/SPARK-25131
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.1
>Reporter: Ajith S
>Priority: Minor
>
> When *master=yarn* and *deploy-mode=client*, event logs do not contain 
> applicationAttemptId for SparkListenerApplicationStart. This is caused at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend#start where we 
> do bindToYarn(client.submitApplication(), None) which sets appAttemptId to 
> None. We can however, get the appAttemptId after waitForApplication() and set 
> it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585326#comment-16585326
 ] 

Apache Spark commented on SPARK-24434:
--

User 'onursatici' has created a pull request for this issue:
https://github.com/apache/spark/pull/22146

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24434:


Assignee: Apache Spark

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Apache Spark
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24434:


Assignee: (was: Apache Spark)

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585321#comment-16585321
 ] 

Saisai Shao commented on SPARK-25114:
-

What's the ETA of this issue [~jiangxb1987]?

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-08-19 Thread Leif Walsh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585286#comment-16585286
 ] 

Leif Walsh edited comment on SPARK-21187 at 8/19/18 10:44 PM:
--

[~bryanc] is there anything I can help elaborate on, or do you just need to 
decide whether or not to do it? (regarding multi-indexing)


was (Author: leif):
[~bryanc] is there anything I can help elaborate on, or do you just need to 
decide whether or not to do it? 

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * -*Binary*-
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-08-19 Thread Leif Walsh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585286#comment-16585286
 ] 

Leif Walsh commented on SPARK-21187:


[~bryanc] is there anything I can help elaborate on, or do you just need to 
decide whether or not to do it? 

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * -*Binary*-
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25152) Enable Spark on Kubernetes R Integration Tests

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585280#comment-16585280
 ] 

Apache Spark commented on SPARK-25152:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22145

> Enable Spark on Kubernetes R Integration Tests
> --
>
> Key: SPARK-25152
> URL: https://issues.apache.org/jira/browse/SPARK-25152
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, SparkR
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> We merged [https://github.com/apache/spark/pull/21584] for SPARK-24433 but we 
> had to turn off the integration tests due to issues with the Jenkins 
> environment. Re-enable the tests after the environment is fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25152) Enable Spark on Kubernetes R Integration Tests

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25152:


Assignee: (was: Apache Spark)

> Enable Spark on Kubernetes R Integration Tests
> --
>
> Key: SPARK-25152
> URL: https://issues.apache.org/jira/browse/SPARK-25152
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, SparkR
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> We merged [https://github.com/apache/spark/pull/21584] for SPARK-24433 but we 
> had to turn off the integration tests due to issues with the Jenkins 
> environment. Re-enable the tests after the environment is fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25152) Enable Spark on Kubernetes R Integration Tests

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25152:


Assignee: Apache Spark

> Enable Spark on Kubernetes R Integration Tests
> --
>
> Key: SPARK-25152
> URL: https://issues.apache.org/jira/browse/SPARK-25152
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, SparkR
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> We merged [https://github.com/apache/spark/pull/21584] for SPARK-24433 but we 
> had to turn off the integration tests due to issues with the Jenkins 
> environment. Re-enable the tests after the environment is fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24935:


Assignee: Apache Spark

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Assignee: Apache Spark
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585259#comment-16585259
 ] 

Apache Spark commented on SPARK-24935:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/22144

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24935:


Assignee: (was: Apache Spark)

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24961) sort operation causes out of memory

2018-08-19 Thread Markus Breuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585258#comment-16585258
 ] 

Markus Breuer commented on SPARK-24961:
---

I retested the above code and had to correct setting 
{color:#d04437}spark.driver.maxResultSize{color:#33} to{color} 0 
{color:#33}(no limit), otherwise task had been cancelled. I am not sure why 
it happened this time - probably a different java version or my mistake - so 
this code only runs in oom when result is set to unlimited.{color}{color}

Than I used values from 10,25,50,100,2000 for 
{color:#d04437}spark.sql.execution.rangeExchange.sampleSizePerPartition{color:#33},
 but task runs in oom at partition 70 of 200.{color}{color}

{color:#d04437}{color:#33}Finally I used partition count from 20,200,2000 
but this also makes no difference. When using more partitions the counter 
increases a bit more, but oom occurs - I think - at nearly same 
position.{color}{color}

{color:#d04437}{color:#33}MaxResultSize and out of memory fits together. 
Sorting this large objects seem to use much memory and cause oom. From behavior 
I think spark tries to collect whole result to driver before writing it to 
disk. I removed df.show() from code, so that write() is the only operation. 
There should be no difference from local to distributed mode here, write() 
should have same bahavior, shouldn't it?{color}{color}

{color:#d04437}{color:#33}I think maxResultSize should have no effect to 
any sort-operation. What do you think?{color}{color}

> sort operation causes out of memory 
> 
>
> Key: SPARK-24961
> URL: https://issues.apache.org/jira/browse/SPARK-24961
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1
> Environment: Java 1.8u144+
> Windows 10
> Spark 2.3.1 in local mode
> -Xms4g -Xmx4g
> optional: -XX:+UseParallelOldGC 
>Reporter: Markus Breuer
>Priority: Major
>
> A sort operation on large rdd - which does not fit in memory - causes out of 
> memory exception. I made the effect reproducable by an sample, the sample 
> creates large object of about 2mb size. When saving result the oom occurs. I 
> tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, 
> MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY 
> seems to work.
> When replacing sort() with sortWithinPartitions() no StorageLevel is required 
> and application succeeds.
> {code:java}
> package de.bytefusion.examples;
> import breeze.storage.Storage;
> import de.bytefusion.utils.Options;
> import org.apache.hadoop.io.MapFile;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.SequenceFileOutputFormat;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> import static org.apache.spark.sql.functions.*;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.UUID;
> import java.util.stream.Collectors;
> import java.util.stream.IntStream;
> public class Example3 {
> public static void main(String... args) {
> // create spark session
> SparkSession spark = SparkSession.builder()
> .appName("example1")
> .master("local[4]")
> .config("spark.driver.maxResultSize","1g")
> .config("spark.driver.memory","512m")
> .config("spark.executor.memory","512m")
> .config("spark.local.dir","d:/temp/spark-tmp")
> .getOrCreate();
> JavaSparkContext sc = 
> JavaSparkContext.fromSparkContext(spark.sparkContext());
> // base to generate huge data
> List list = new ArrayList<>();
> for (int val = 1; val < 1; val++) {
> int valueOf = Integer.valueOf(val);
> list.add(valueOf);
> }
> // create simple rdd of int
> JavaRDD rdd = sc.parallelize(list,200);
> // use map to create large object per row
> JavaRDD rowRDD =
> rdd
> .map(value -> 
> RowFactory.create(String.valueOf(value), 
> createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024)))
> // no persist => out of memory exception on write()
> // persist MEMORY_AND_DISK => out of memory exception 
> on write()
> // persist MEMORY_AND_DISK_SER => out of memory 
> exception on 

[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585244#comment-16585244
 ] 

Dongjoon Hyun commented on SPARK-25145:
---

[~bjensen]. I tested this on both Spark 2.3.1 and Today's `branch-2.3` with 
Python 3.7, it seems to work. Could you confirm that you saw that at the latest 
branch? Or, did I miss something from what you reported?
{code}
Using Python version 3.7.0 (default, Jun 29 2018 20:13:13)
SparkSession available as 'spark'.
>>> spark.version
'2.3.3-SNAPSHOT'
>>> import numpy as np
>>> import pandas as pd
>>> spark.conf.set('spark.sql.orc.impl', 'native')
>>> spark.conf.set('spark.sql.orc.filterPushdown', True)
>>> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
>>> sdf = spark.createDataFrame(df)
>>> sdf.write.saveAsTable(format='orc', mode='overwrite', name='t1', 
>>> compression='zlib')
>>> spark.sql('SELECT * FROM t1 WHERE a > 5').show()
+---+---+
|  a|  b|
+---+---+
|  8|4.0|
|  9|4.5|
|  7|3.5|
|  6|3.0|
+---+---+
{code}

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter 

[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True

2018-08-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585242#comment-16585242
 ] 

Dongjoon Hyun commented on SPARK-25145:
---

Thank you for pinging me, [~mgaido]. I'll take a look at this.

> Buffer size too small on spark.sql query with filterPushdown predicate=True
> ---
>
> Key: SPARK-25145
> URL: https://issues.apache.org/jira/browse/SPARK-25145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
> Environment:  
> {noformat}
> # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018
> spark.driver.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.eventLog.dir hdfs:///spark2-history/
> spark.eventLog.enabled true
> spark.executor.extraLibraryPath 
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.hadoop.hive.vectorized.execution.enabled true
> spark.history.fs.logDirectory hdfs:///spark2-history/
> spark.history.kerberos.keytab none
> spark.history.kerberos.principal none
> spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
> spark.history.retainedApplications 50
> spark.history.ui.port 18081
> spark.io.compression.lz4.blockSize 128k
> spark.locality.wait 2s
> spark.network.timeout 600s
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.shuffle.consolidateFiles true
> spark.shuffle.io.numConnectionsPerPeer 10
> spark.sql.autoBroadcastJoinTreshold 26214400
> spark.sql.shuffle.partitions 300
> spark.sql.statistics.fallBack.toHdfs true
> spark.sql.tungsten.enabled true
> spark.driver.memoryOverhead 2048
> spark.executor.memoryOverhead 4096
> spark.yarn.historyServer.address service-10-4.local:18081
> spark.yarn.queue default
> spark.sql.warehouse.dir hdfs:///apps/hive/warehouse
> spark.sql.execution.arrow.enabled true
> spark.sql.hive.convertMetastoreOrc true
> spark.sql.orc.char.enabled true
> spark.sql.orc.enabled true
> spark.sql.orc.filterPushdown true
> spark.sql.orc.impl native
> spark.sql.orc.enableVectorizedReader true
> spark.yarn.jars hdfs:///apps/spark-jars/231/jars/*
> {noformat}
>  
>Reporter: Bjørnar Jensen
>Priority: Minor
> Attachments: create_bug.py, report.txt
>
>
> java.lang.IllegalArgumentException: Buffer size too small. size = 262144 
> needed = 2205991
>  # 
> {code:java}
> Python
> import numpy as np
> import pandas as pd
> # Create a spark dataframe
> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0})
> sdf = spark.createDataFrame(df)
> print('Created spark dataframe:')
> sdf.show()
> # Save table as orc
> sdf.write.saveAsTable(format='orc', mode='overwrite', 
> name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', 
> compression='zlib')
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Fetch entire table (works)
> print('Read entire table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown').show()
> # Ensure filterPushdown is disabled
> spark.conf.set('spark.sql.orc.filterPushdown', False)
> # Query without filterPushdown (works)
> print('Read a selection from table with "filterPushdown"=False')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> # Ensure filterPushdown is enabled
> spark.conf.set('spark.sql.orc.filterPushdown', True)
> # Query with filterPushDown (fails)
> print('Read a selection from table with "filterPushdown"=True')
> spark.sql('SELECT * FROM 
> bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show()
> {code}
> {noformat}
> ~/bug_report $ pyspark
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port 
> 4040. Attempting port 4041.
> Jupyter console 5.1.0
> Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: %run -i create_bug.py
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT
> /_/
> Using Python version 3.6.3 (default, May 4 2018 04:22:28)
> SparkSession available as 'spark'.
> Created spark dataframe:
> +---+---+
> | a| b|
> +---+---+
> | 0|0.0|
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 4|2.0|
> | 5|2.5|
> | 6|3.0|
> | 7|3.5|
> | 8|4.0|
> | 9|4.5|
> +---+---+
> Read entire table with "filterPushdown"=True
> +---+---+
> | a| b|
> +---+---+
> | 1|0.5|
> | 2|1.0|
> | 3|1.5|
> | 5|2.5|
> | 

[jira] [Assigned] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24647:


Assignee: Apache Spark

> Sink Should Return Writen Offsets For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Assignee: Apache Spark
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently written data offsets 
> are not available on {{Sink}} or {{StreamWriter}} interface. Implementing as 
> proposed would also bring symmetry to {{StreamingQueryProgress}} fields 
> sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * Object {{WriterCommitMessage}} does not carry any progress information 
> about comitted rows.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * Implement support only for v2 sinks as those are to use used in future.
>  * {{WriterCommitMessage}} to hold optional min and max offset information of 
> commited rows e.g. Kafka does it by returning {{RecordMetadata}} object from 
> {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Me and [~wajda] as soon as prerequisite jira is merged.
>  * {{Sinks}}: Modify all v2 sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585230#comment-16585230
 ] 

Apache Spark commented on SPARK-24647:
--

User 'vackosar' has created a pull request for this issue:
https://github.com/apache/spark/pull/22143

> Sink Should Return Writen Offsets For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently written data offsets 
> are not available on {{Sink}} or {{StreamWriter}} interface. Implementing as 
> proposed would also bring symmetry to {{StreamingQueryProgress}} fields 
> sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * Object {{WriterCommitMessage}} does not carry any progress information 
> about comitted rows.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * Implement support only for v2 sinks as those are to use used in future.
>  * {{WriterCommitMessage}} to hold optional min and max offset information of 
> commited rows e.g. Kafka does it by returning {{RecordMetadata}} object from 
> {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Me and [~wajda] as soon as prerequisite jira is merged.
>  * {{Sinks}}: Modify all v2 sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24647:


Assignee: (was: Apache Spark)

> Sink Should Return Writen Offsets For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently written data offsets 
> are not available on {{Sink}} or {{StreamWriter}} interface. Implementing as 
> proposed would also bring symmetry to {{StreamingQueryProgress}} fields 
> sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * Object {{WriterCommitMessage}} does not carry any progress information 
> about comitted rows.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * Implement support only for v2 sinks as those are to use used in future.
>  * {{WriterCommitMessage}} to hold optional min and max offset information of 
> commited rows e.g. Kafka does it by returning {{RecordMetadata}} object from 
> {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Me and [~wajda] as soon as prerequisite jira is merged.
>  * {{Sinks}}: Modify all v2 sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25132:


Assignee: Apache Spark

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25132:


Assignee: (was: Apache Spark)

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585198#comment-16585198
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22142

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25156) Same query returns different result

2018-08-19 Thread Yonghwan Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonghwan Lee updated SPARK-25156:
-
Description: 
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
  
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

+ I execute 
{code:java}
sql("set spark.sql.shuffle.partitions=801"){code}
before execute query.

A, B Table has lot of rows but C Table has small dataset, so when i saw 
physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
performed with Broadcast hash join.

 

And now, i removed set spark.sql.shuffle.partitions statement, it works fine.

Is this spark sql's bug?

  was:
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
  
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

+ I execute 
{code:java}
sql("set spark.sql.shuffle.partitions=801"){code}
before execute query.

A, B Table has lot of rows but C Table has small dataset, so when i saw 
physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
performed with Broadcast hash join.

Is this spark sql's bug?


> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> before execute query.
> A, B Table has lot of rows but C Table has small dataset, so when i saw 
> physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
> performed with Broadcast hash join.
>  
> And now, i removed set spark.sql.shuffle.partitions statement, it works fine.
> Is this spark sql's bug?



--
This message was sent by 

[jira] [Updated] (SPARK-25156) Same query returns different result

2018-08-19 Thread Yonghwan Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonghwan Lee updated SPARK-25156:
-
Description: 
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
  
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

+ I execute 
{code:java}
sql("set spark.sql.shuffle.partitions=801"){code}
before execute query.

A, B Table has lot of rows but C Table has small dataset, so when i saw 
physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
performed with Broadcast hash join.

Is this spark sql's bug?

  was:
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
  
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

+ I execute 
{code:java}
sql("set spark.sql.shuffle.partitions=801"){code}
Is this spark sql's bug?


> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> before execute query.
> A, B Table has lot of rows but C Table has small dataset, so when i saw 
> physical plan, A<-> B join performed with SortMergeJoin and (A,B) <-> C join 
> performed with Broadcast hash join.
> Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Description: We are doing video analytics for video streams using Spark. At 
present there is no direct way to stream video frames or image files to Spark 
and process them using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Image files. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
class, we think streaming capabilities can be added for image files. It is fine 
if it won't support some of the structured streaming features as it is a binary 
file. This method could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
  (was: We are doing video analytics for video streams using Spark. At present 
there is no direct way to stream video frames or image files to Spark and 
process them using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Image files. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
class, we think streaming capabilities can be added for image files. It is fine 
if it won't support some of the structured streaming features as it is a binary 
file. Schema used in ImageSchema class for image can be used for Dataset. This 
method could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py])

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Image files. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. This method 
> could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25156) Same query returns different result

2018-08-19 Thread Yonghwan Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonghwan Lee updated SPARK-25156:
-
Description: 
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
  
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

+ I execute 
{code:java}
sql("set spark.sql.shuffle.partitions=801"){code}
Is this spark sql's bug?

  was:
I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
 
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

Is this spark sql's bug?


> Same query returns different result
> ---
>
> Key: SPARK-25156
> URL: https://issues.apache.org/jira/browse/SPARK-25156
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: * Spark Version: 2.1.1
>  * Java Version: Java 7
>  * Scala Version: 2.11.8
>Reporter: Yonghwan Lee
>Priority: Major
>  Labels: Question
>
> I performed two joins and two left outer join on five tables.
> There are several different results when you run the same query multiple 
> times.
> Table A
>   
> ||Column a||Column b||Column c||Column d||
> |Long(nullable: false)|Integer(nullable: false)|String(nullable: 
> true)|String(nullable: false)|
> Table B
> ||Column a||Column b||
> |Long(nullable: false)|String(nullable: false)|
> Table C
> ||Column a||Column b||
> |Integer(nullable: false)|String(nullable: false)|
> Table D
> ||Column a||Column b||Column c||
> |Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|
> Table E
> ||Column a||Column b||Column c||
> |Long(nullable: false)|Integer(nullable: false)|String|
> Query(Spark SQL)
> {code:java}
> select A.c, B.b, C.b, D.c, E.c
> inner join B on A.a = B.a
> inner join C on A.b = C.a
> left outer join D on A.d <=> cast(D.a as string)
> left outer join E on D.b = E.a and D.c = E.b{code}
>  
> I performed above query 10 times, it returns 7 times correct result(count: 
> 830001460) and 3 times incorrect result(count: 830001299)
>  
> + I execute 
> {code:java}
> sql("set spark.sql.shuffle.partitions=801"){code}
> Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25132:
-
Summary: Case-insensitive field resolution when reading from Parquet/ORC  
(was: Spark returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases)

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Description: We are doing video analytics for video streams using Spark. At 
present there is no direct way to stream video frames or image files to Spark 
and process them using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Image files. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
class, we think streaming capabilities can be added for image files. It is fine 
if it won't support some of the structured streaming features as it is a binary 
file. Schema used in ImageSchema class for image can be used for Dataset. This 
method could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
  (was: We are doing video analytics for video streams using Spark. At present 
there is no direct way to stream video frames or image files to Spark and 
process them using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* class, 
we think streaming capabilities can be added for image files. It is fine if it 
won't support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used for Dataset. This method 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py])

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Image files. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. Schema used 
> in ImageSchema class for image can be used for Dataset. This method could be 
> similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Description: We are doing video analytics for video streams using Spark. At 
present there is no direct way to stream video frames or image files to Spark 
and process them using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* class, 
we think streaming capabilities can be added for image files. It is fine if it 
won't support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used for Dataset. This method 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
  (was: We are doing video analytics for video streams using Spark. At present 
there is no direct way to stream video frames or image files to Spark and 
process using Structured Streaming and Dataset. We are using Kafka to stream 
images and then doing processing at spark. We need a method in Spark to stream 
images from directory. Currently *{{DataStreamReader}}* doesn't support Images. 
With the introduction of *org.apache.spark.ml.image.ImageSchema* class, we 
think streaming capabilities can be added for images. It is fine if it won't 
support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used in Dataset. This method 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py])

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Images. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. Schema used 
> in ImageSchema class for image can be used for Dataset. This method could be 
> similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Priority: Major  (was: Minor)

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> using Structured Streaming and Dataset. We are using Kafka to stream images 
> and then doing processing at spark. We need a method in Spark to stream 
> images from directory. Currently *{{DataStreamReader}}* doesn't support 
> Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
> class, we think streaming capabilities can be added for images. It is fine if 
> it won't support some of the structured streaming features as it is a binary 
> file. Schema used in ImageSchema class for image can be used in Dataset. This 
> method could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Description: We are doing video analytics for video streams using Spark. At 
present there is no direct way to stream video frames or image files to Spark 
and process using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* class, 
we think streaming capabilities can be added for images. It is fine if it won't 
support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used in Dataset. This feature 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
  (was: We are doing video analytics for video streams using Spark. At present 
there is no direct way to stream video frames or image files to Spark and 
process using Structured Streaming and Dataset. We are using Kafka to stream 
images and then doing processing at spark. We need a method in Spark to stream 
images from directory. Currently *{{DataStreamReader}}* doesn't support Images. 
With the introduction of *org.apache.spark.ml.image.ImageSchema* class, we 
think streaming capabilities can be added for images. It is fine if it won't 
support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used in Dataset. This feature 
could be similar to *mmlspark* *streamImages* method. 
([https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
 ))

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Minor
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> using Structured Streaming and Dataset. We are using Kafka to stream images 
> and then doing processing at spark. We need a method in Spark to stream 
> images from directory. Currently *{{DataStreamReader}}* doesn't support 
> Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
> class, we think streaming capabilities can be added for images. It is fine if 
> it won't support some of the structured streaming features as it is a binary 
> file. Schema used in ImageSchema class for image can be used in Dataset. This 
> feature could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-25157:

Description: We are doing video analytics for video streams using Spark. At 
present there is no direct way to stream video frames or image files to Spark 
and process using Structured Streaming and Dataset. We are using Kafka to 
stream images and then doing processing at spark. We need a method in Spark to 
stream images from directory. Currently *{{DataStreamReader}}* doesn't support 
Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* class, 
we think streaming capabilities can be added for images. It is fine if it won't 
support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used in Dataset. This method 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
  (was: We are doing video analytics for video streams using Spark. At present 
there is no direct way to stream video frames or image files to Spark and 
process using Structured Streaming and Dataset. We are using Kafka to stream 
images and then doing processing at spark. We need a method in Spark to stream 
images from directory. Currently *{{DataStreamReader}}* doesn't support Images. 
With the introduction of *org.apache.spark.ml.image.ImageSchema* class, we 
think streaming capabilities can be added for images. It is fine if it won't 
support some of the structured streaming features as it is a binary file. 
Schema used in ImageSchema class for image can be used in Dataset. This feature 
could be similar to *mmlspark* *streamImages* method. 
[https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py])

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Minor
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> using Structured Streaming and Dataset. We are using Kafka to stream images 
> and then doing processing at spark. We need a method in Spark to stream 
> images from directory. Currently *{{DataStreamReader}}* doesn't support 
> Images. With the introduction of *org.apache.spark.ml.image.ImageSchema* 
> class, we think streaming capabilities can be added for images. It is fine if 
> it won't support some of the structured streaming features as it is a binary 
> file. Schema used in ImageSchema class for image can be used in Dataset. This 
> method could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25157) Streaming of image files from directory

2018-08-19 Thread Amit Baghel (JIRA)
Amit Baghel created SPARK-25157:
---

 Summary: Streaming of image files from directory
 Key: SPARK-25157
 URL: https://issues.apache.org/jira/browse/SPARK-25157
 Project: Spark
  Issue Type: New Feature
  Components: ML, Structured Streaming
Affects Versions: 2.3.1
Reporter: Amit Baghel


We are doing video analytics for video streams using Spark. At present there is 
no direct way to stream video frames or image files to Spark and process using 
Structured Streaming and Dataset. We are using Kafka to stream images and then 
doing processing at spark. We need a method in Spark to stream images from 
directory. Currently *{{DataStreamReader}}* doesn't support Images. With the 
introduction of *org.apache.spark.ml.image.ImageSchema* class, we think 
streaming capabilities can be added for images. It is fine if it won't support 
some of the structured streaming features as it is a binary file. Schema used 
in ImageSchema class for image can be used in Dataset. This feature could be 
similar to *mmlspark* *streamImages* method. 
([https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]
 )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20924) Unable to call the function registered in the not-current database

2018-08-19 Thread Rajesh Chandramohan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585070#comment-16585070
 ] 

Rajesh Chandramohan edited comment on SPARK-20924 at 8/19/18 8:45 AM:
--

[~cloud_fan]  / [~smilegator]

Looks like we can't backport this change to 2.1.0 as lot of dependency on 
improvement changes on Spark-2.2 


was (Author: rajeshhadoop):
[~cloud_fan] 

Looks like we can't backport this change to 2.1.0 as lot of dependency on 
improvement changes on Spark-2.2 

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20924) Unable to call the function registered in the not-current database

2018-08-19 Thread Rajesh Chandramohan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585070#comment-16585070
 ] 

Rajesh Chandramohan commented on SPARK-20924:
-

[~cloud_fan] 

Looks like we can't backport this change to 2.1.0 as lot of dependency on 
improvement changes on Spark-2.2 

> Unable to call the function registered in the not-current database
> --
>
> Key: SPARK-20924
> URL: https://issues.apache.org/jira/browse/SPARK-20924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> We are unable to call the function registered in the not-current database. 
> {noformat}
> sql("CREATE DATABASE dAtABaSe1")
> sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS 
> '${classOf[GenericUDAFAverage].getName}'")
> sql("SELECT dAtABaSe1.test_avg(1)")
> {noformat}
> The above code returns an error:
> {noformat}
> Undefined function: 'dAtABaSe1.test_avg'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25156) Same query returns different result

2018-08-19 Thread Yonghwan Lee (JIRA)
Yonghwan Lee created SPARK-25156:


 Summary: Same query returns different result
 Key: SPARK-25156
 URL: https://issues.apache.org/jira/browse/SPARK-25156
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.1.1
 Environment: * Spark Version: 2.1.1
 * Java Version: Java 7
 * Scala Version: 2.11.8
Reporter: Yonghwan Lee


I performed two joins and two left outer join on five tables.

There are several different results when you run the same query multiple times.

Table A
 
||Column a||Column b||Column c||Column d||
|Long(nullable: false)|Integer(nullable: false)|String(nullable: 
true)|String(nullable: false)|

Table B
||Column a||Column b||
|Long(nullable: false)|String(nullable: false)|

Table C
||Column a||Column b||
|Integer(nullable: false)|String(nullable: false)|

Table D
||Column a||Column b||Column c||
|Long(nullable: true)|Long(nullable: false)|Integer(nullable: false)|

Table E
||Column a||Column b||Column c||
|Long(nullable: false)|Integer(nullable: false)|String|

Query(Spark SQL)
{code:java}
select A.c, B.b, C.b, D.c, E.c
inner join B on A.a = B.a
inner join C on A.b = C.a
left outer join D on A.d <=> cast(D.a as string)
left outer join E on D.b = E.a and D.c = E.b{code}
 

I performed above query 10 times, it returns 7 times correct result(count: 
830001460) and 3 times incorrect result(count: 830001299)

 

Is this spark sql's bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org