[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741818#comment-16741818
 ] 

Hyukjin Kwon commented on SPARK-26591:
--

Can you add Pandas and NumPy versions as well

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
> PySpark 2.4.0
> PyArrow: 0.11.1
> Pandas:
> NumPy:
> OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26591) illegal hardware instruction

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26591:
-
Target Version/s:   (was: 2.4.0)

> illegal hardware instruction
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26591:
-
Description: 
When I try to use pandas_udf from examples in 
[documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.types import IntegerType, StringType

slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
crashed{code}
I get the error:
{code:java}
[1]    17969 illegal hardware instruction (core dumped)  python3{code}

The environment is:

Python 3.6.7
PySpark 2.4.0
PyArrow: 0.11.1
Pandas:
NumPy:
OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
x86_64 x86_64 GNU/Linux


  was:
When I try to use pandas_udf from examples in 
[documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.types import IntegerType, StringType

slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
crashed{code}
I get the error:
{code:java}
[1]    17969 illegal hardware instruction (core dumped)  python3{code}

The environment is:

Python 3.6.7
PySpark 2.4.0
PyArrow: 0.11.1
Pandas:
NumPy:



> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
> PySpark 2.4.0
> PyArrow: 0.11.1
> Pandas:
> NumPy:
> OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26591:
-
Description: 
When I try to use pandas_udf from examples in 
[documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.types import IntegerType, StringType

slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
crashed{code}
I get the error:
{code:java}
[1]    17969 illegal hardware instruction (core dumped)  python3{code}

The environment is:

Python 3.6.7
PySpark 2.4.0
PyArrow: 0.11.1
Pandas:
NumPy:


  was:
When I try to use pandas_udf from examples in 
[documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.types import IntegerType, StringType

slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
crashed{code}
I get the error:
{code:java}
[1]    17969 illegal hardware instruction (core dumped)  python3{code}


> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
> PySpark 2.4.0
> PyArrow: 0.11.1
> Pandas:
> NumPy:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26591) illegal hardware instruction

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26591:
-
Priority: Major  (was: Critical)

> illegal hardware instruction
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26591:
-
Summary: Scalar Pandas UDF fails with 'illegal hardware instruction' in a 
certain environment  (was: illegal hardware instruction)

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses

2019-01-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741813#comment-16741813
 ] 

Hyukjin Kwon commented on SPARK-26602:
--

Mind including reproducible steps as well please?

> Once creating and quering udf with incorrect path,followed by querying tables 
> or functions registered with correct path gives the runtime exception within 
> the same session
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26609) Kinesis-Spark Stream unable to process records

2019-01-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26609.
--
Resolution: Invalid

Sounds like closer to a question. Let's ask to dev mailing list first, and then 
file the JIRA with some analysis for the symptoms. You could have a better 
answer in the mailing list. I'm resolving this for now.

> Kinesis-Spark Stream unable to process records
> --
>
> Key: SPARK-26609
> URL: https://issues.apache.org/jira/browse/SPARK-26609
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
> Environment:  
> {code:java}
> 2.2.0
> 
> 
>  org.apache.spark
>  spark-core_${scala.binary.version}
>  ${spark.version}
> 
> 
> 
>  org.apache.spark
>  spark-sql_${scala.binary.version}
>  ${spark.version}
> 
> 
> 
>  org.apache.spark
>  spark-hive_${scala.binary.version}
>  ${spark.version}
> 
> 
> 
>  org.apache.spark
>  spark-mllib_${scala.binary.version}
>  ${spark.version}
> 
> 
> 
>  org.apache.spark
>  spark-streaming_${scala.binary.version}
>  ${spark.version}
> 
> 
> 
>  org.apache.spark
>  spark-streaming-kinesis-asl_2.11
>  ${spark.version}
> 
> 
> 
>  com.databricks
>  spark-redshift_2.11
>  3.0.0-preview1
> 
> 
> 
>  com.amazon.redshift
>  redshift-jdbc42
>  1.2.18.1036
> 
> {code}
>  
>  
> spark.driver.cores=6
> spark.driver.memory=12g
> spark.yarn.driver.memoryOverhead=1g
> spark.driver.maxResultSize=4g
> spark.executor.memory=8g
> spark.executor.cores=4
> spark.yarn.executor.memoryOverhead=1g
> spark.executor.instances=4
> spark.shuffle.service.enabled=true
> spark.shuffle.registration.timeout=600
> spark.sql.shuffle.partitions=8
> spark.scheduler.mode=FIFO
> maximizeResourceAllocation=true
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.executorIdleTimeout=60s
>  
>Reporter: Aman Mundra
>Priority: Major
> Attachments: 1.PNG, 2.PNG
>
>
> I'm trying to consume kinesis stream via spark streaming and amazon KCL lib.
> Streaming job gets stuck at processing as so
> on as it gets the first batch of non zero records.
> I'm getting json data in my kinesis stream and here's what I'm trying to 
> achieve:
> Get Dstream[ArrayByte] > convert to Dstream[String] > RDD > load as json to 
> create dataframe and perform transformations.
>  
> Similar error links:
> [https://stackoverflow.com/questions/40225135/spark-streaming-kafka-job-stuck-in-processing-stage]
> I'm running the job in emr-5.8.0 with enough number of cores and executors 
> but still the job gets stuck in processing stage and build a huge pile of 
> queued batches over time.
> Not able to process even a single record.
>  
> Here's the code I'm using:
>  
>  
> {code:java}
> val numStreams=2
> val sparkStreamingBatchInterval=10
> val kinesisCheckpointInterval=5
>  
> val kinesisStreams = (0 until kinesisConfig("numStreams").toInt).map { i =>
>  KinesisInputDStream.builder
>  .streamingContext(ssc)
>  .endpointUrl(kinesisConfig("endpointUrl"))
>  .regionName(kinesisConfig("regionName"))
>  .streamName(kinesisConfig("streamName"))
>  .initialPositionInStream(InitialPositionInStream.LATEST)
>  .checkpointAppName(kinesisConfig("appName"))
>  
> .checkpointInterval(Seconds(kinesisConfig("kinesisCheckpointInterval").toInt))
>  .storageLevel(StorageLevel.MEMORY_AND_DISK_2)
>  .kinesisCredentials(awsCredentials.build())
>  .build()
> }
> val unionStreams = ssc.union(kinesisStreams)
> val lines = unionStreams.flatMap(byteArray => new String(byteArray).split(" 
> "))
> lines.print(2)
> lines.foreachRDD(rdd => {
>  if
>  (!rdd.partitions.isEmpty){
>  println("New records found\nmetrics count in the batch: 
> %s".format(rdd.count())) //works
>  println("performing transformations")
>  rdd.saveAsTextFile("path")//works
>  import sparkSession.implicits._
>  println(rdd.toString()) //not working
>  val records = rdd.toDF("records") //not working
>  println(records.take(2)) //not working
>  println(records.count()) //not working
>  }
>  else
>  println("No new record found")
> })
>  
> {code}
>  
> Attaching Thread dump:
> h3. Thread dump for executor 2
> Updated at 2019/01/12 10:22:52
>  
> Collapse All
>  Search: 
>   
> ||Thread ID||Thread Name||Thread State||Thread Locks||
> |65|Executor task launch worker for task 
> 70|WAITING|Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1560902703})|
> |sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchro

[jira] [Assigned] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26610:


Assignee: (was: Apache Spark)

> Fix inconsistency between toJSON Method in Python and Scala
> ---
>
> Key: SPARK-26610
> URL: https://issues.apache.org/jira/browse/SPARK-26610
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string 
> instead of {{RDD}}. The method in Scala/Java was changed to return 
> {{DataFrame}} before, but the one in PySpark was not changed at that time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26610:


Assignee: Apache Spark

> Fix inconsistency between toJSON Method in Python and Scala
> ---
>
> Key: SPARK-26610
> URL: https://issues.apache.org/jira/browse/SPARK-26610
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string 
> instead of {{RDD}}. The method in Scala/Java was changed to return 
> {{DataFrame}} before, but the one in PySpark was not changed at that time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala

2019-01-13 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26610:
-

 Summary: Fix inconsistency between toJSON Method in Python and 
Scala
 Key: SPARK-26610
 URL: https://issues.apache.org/jira/browse/SPARK-26610
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Takuya Ueshin


{{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string 
instead of {{RDD}}. The method in Scala/Java was changed to return 
{{DataFrame}} before, but the one in PySpark was not changed at that time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit categories.

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26466:


Assignee: Apache Spark

> Use ConfigEntry for hardcoded configs for submit categories.
> 
>
> Key: SPARK-26466
> URL: https://issues.apache.org/jira/browse/SPARK-26466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Make the following hardcoded configs to use {{ConfigEntry}}.
> {code}
> spark.kryo
> spark.kryoserializer
> spark.jars
> spark.submit
> spark.serializer
> spark.deploy
> spark.worker
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit categories.

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26466:


Assignee: (was: Apache Spark)

> Use ConfigEntry for hardcoded configs for submit categories.
> 
>
> Key: SPARK-26466
> URL: https://issues.apache.org/jira/browse/SPARK-26466
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Make the following hardcoded configs to use {{ConfigEntry}}.
> {code}
> spark.kryo
> spark.kryoserializer
> spark.jars
> spark.submit
> spark.serializer
> spark.deploy
> spark.worker
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26450) Map of schema is built too frequently in some wide queries

2019-01-13 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26450.
---
   Resolution: Fixed
 Assignee: Bruce Robbins
Fix Version/s: 3.0.0

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 3.0.0
>
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23182) Allow enabling of TCP keep alive for RPC connections

2019-01-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23182:
-

Assignee: Petar Petrov

> Allow enabling of TCP keep alive for RPC connections
> 
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.4.0
>Reporter: Petar Petrov
>Assignee: Petar Petrov
>Priority: Minor
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23182) Allow enabling of TCP keep alive for RPC connections

2019-01-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23182.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20512
[https://github.com/apache/spark/pull/20512]

> Allow enabling of TCP keep alive for RPC connections
> 
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.4.0
>Reporter: Petar Petrov
>Assignee: Petar Petrov
>Priority: Minor
> Fix For: 3.0.0
>
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24497) Support recursive SQL query

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24497:


Assignee: Apache Spark

> Support recursive SQL query
> ---
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24497) Support recursive SQL query

2019-01-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24497:


Assignee: (was: Apache Spark)

> Support recursive SQL query
> ---
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer

2019-01-13 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-25853:
-
Fix Version/s: (was: 2.3.3)

> Parts of spark components (DAG Visualizationand executors page) not available 
> in Internet Explorer
> --
>
> Key: SPARK-25853
> URL: https://issues.apache.org/jira/browse/SPARK-25853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0, 2.3.2
>Reporter: aastha
>Priority: Major
> Attachments: dag_error_ie.png, dag_not_rendered_ie.png, 
> dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, 
> executors_on_chrome.png
>
>
> Spark UI has come limitations when working with Internet Explorer. The DAG 
> component as well as Executors page does not render, it works on Firefox and 
> Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works 
> on Chrome and Firefox their versions should not matter.
> For executors page, the root cause is that document.baseURI property is 
> undefined in Internet Explorer. When I debug by providing the property 
> myself, it shows up fine.
> For DAG component, developer tools haven't helped. 
> Attaching screenshots for Chrome and IE UI and debug console messages. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer

2019-01-13 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741511#comment-16741511
 ] 

Takeshi Yamamuro commented on SPARK-25853:
--

Yea, thanks for pinging me. Dropped.

> Parts of spark components (DAG Visualizationand executors page) not available 
> in Internet Explorer
> --
>
> Key: SPARK-25853
> URL: https://issues.apache.org/jira/browse/SPARK-25853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0, 2.3.2
>Reporter: aastha
>Priority: Major
> Attachments: dag_error_ie.png, dag_not_rendered_ie.png, 
> dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, 
> executors_on_chrome.png
>
>
> Spark UI has come limitations when working with Internet Explorer. The DAG 
> component as well as Executors page does not render, it works on Firefox and 
> Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works 
> on Chrome and Firefox their versions should not matter.
> For executors page, the root cause is that document.baseURI property is 
> undefined in Internet Explorer. When I debug by providing the property 
> myself, it shows up fine.
> For DAG component, developer tools haven't helped. 
> Attaching screenshots for Chrome and IE UI and debug console messages. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer

2019-01-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741508#comment-16741508
 ] 

Dongjoon Hyun commented on SPARK-25853:
---

cc [~maropu].

It seems that you need to remove the fix version of this issue.

> Parts of spark components (DAG Visualizationand executors page) not available 
> in Internet Explorer
> --
>
> Key: SPARK-25853
> URL: https://issues.apache.org/jira/browse/SPARK-25853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0, 2.3.2
>Reporter: aastha
>Priority: Major
> Fix For: 2.3.3
>
> Attachments: dag_error_ie.png, dag_not_rendered_ie.png, 
> dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, 
> executors_on_chrome.png
>
>
> Spark UI has come limitations when working with Internet Explorer. The DAG 
> component as well as Executors page does not render, it works on Firefox and 
> Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works 
> on Chrome and Firefox their versions should not matter.
> For executors page, the root cause is that document.baseURI property is 
> undefined in Internet Explorer. When I debug by providing the property 
> myself, it shows up fine.
> For DAG component, developer tools haven't helped. 
> Attaching screenshots for Chrome and IE UI and debug console messages. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org