[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741818#comment-16741818 ] Hyukjin Kwon commented on SPARK-26591: -- Can you add Pandas and NumPy versions as well > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: > NumPy: > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26591) illegal hardware instruction
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26591: - Target Version/s: (was: 2.4.0) > illegal hardware instruction > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26591: - Description: When I try to use pandas_udf from examples in [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import IntegerType, StringType slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is crashed{code} I get the error: {code:java} [1] 17969 illegal hardware instruction (core dumped) python3{code} The environment is: Python 3.6.7 PySpark 2.4.0 PyArrow: 0.11.1 Pandas: NumPy: OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux was: When I try to use pandas_udf from examples in [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import IntegerType, StringType slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is crashed{code} I get the error: {code:java} [1] 17969 illegal hardware instruction (core dumped) python3{code} The environment is: Python 3.6.7 PySpark 2.4.0 PyArrow: 0.11.1 Pandas: NumPy: > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: > NumPy: > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26591: - Description: When I try to use pandas_udf from examples in [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import IntegerType, StringType slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is crashed{code} I get the error: {code:java} [1] 17969 illegal hardware instruction (core dumped) python3{code} The environment is: Python 3.6.7 PySpark 2.4.0 PyArrow: 0.11.1 Pandas: NumPy: was: When I try to use pandas_udf from examples in [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import IntegerType, StringType slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is crashed{code} I get the error: {code:java} [1] 17969 illegal hardware instruction (core dumped) python3{code} > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: > NumPy: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26591) illegal hardware instruction
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26591: - Priority: Major (was: Critical) > illegal hardware instruction > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26591: - Summary: Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment (was: illegal hardware instruction) > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741813#comment-16741813 ] Hyukjin Kwon commented on SPARK-26602: -- Mind including reproducible steps as well please? > Once creating and quering udf with incorrect path,followed by querying tables > or functions registered with correct path gives the runtime exception within > the same session > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26609) Kinesis-Spark Stream unable to process records
[ https://issues.apache.org/jira/browse/SPARK-26609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26609. -- Resolution: Invalid Sounds like closer to a question. Let's ask to dev mailing list first, and then file the JIRA with some analysis for the symptoms. You could have a better answer in the mailing list. I'm resolving this for now. > Kinesis-Spark Stream unable to process records > -- > > Key: SPARK-26609 > URL: https://issues.apache.org/jira/browse/SPARK-26609 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 > Environment: > {code:java} > 2.2.0 > > > org.apache.spark > spark-core_${scala.binary.version} > ${spark.version} > > > > org.apache.spark > spark-sql_${scala.binary.version} > ${spark.version} > > > > org.apache.spark > spark-hive_${scala.binary.version} > ${spark.version} > > > > org.apache.spark > spark-mllib_${scala.binary.version} > ${spark.version} > > > > org.apache.spark > spark-streaming_${scala.binary.version} > ${spark.version} > > > > org.apache.spark > spark-streaming-kinesis-asl_2.11 > ${spark.version} > > > > com.databricks > spark-redshift_2.11 > 3.0.0-preview1 > > > > com.amazon.redshift > redshift-jdbc42 > 1.2.18.1036 > > {code} > > > spark.driver.cores=6 > spark.driver.memory=12g > spark.yarn.driver.memoryOverhead=1g > spark.driver.maxResultSize=4g > spark.executor.memory=8g > spark.executor.cores=4 > spark.yarn.executor.memoryOverhead=1g > spark.executor.instances=4 > spark.shuffle.service.enabled=true > spark.shuffle.registration.timeout=600 > spark.sql.shuffle.partitions=8 > spark.scheduler.mode=FIFO > maximizeResourceAllocation=true > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.executorIdleTimeout=60s > >Reporter: Aman Mundra >Priority: Major > Attachments: 1.PNG, 2.PNG > > > I'm trying to consume kinesis stream via spark streaming and amazon KCL lib. > Streaming job gets stuck at processing as so > on as it gets the first batch of non zero records. > I'm getting json data in my kinesis stream and here's what I'm trying to > achieve: > Get Dstream[ArrayByte] > convert to Dstream[String] > RDD > load as json to > create dataframe and perform transformations. > > Similar error links: > [https://stackoverflow.com/questions/40225135/spark-streaming-kafka-job-stuck-in-processing-stage] > I'm running the job in emr-5.8.0 with enough number of cores and executors > but still the job gets stuck in processing stage and build a huge pile of > queued batches over time. > Not able to process even a single record. > > Here's the code I'm using: > > > {code:java} > val numStreams=2 > val sparkStreamingBatchInterval=10 > val kinesisCheckpointInterval=5 > > val kinesisStreams = (0 until kinesisConfig("numStreams").toInt).map { i => > KinesisInputDStream.builder > .streamingContext(ssc) > .endpointUrl(kinesisConfig("endpointUrl")) > .regionName(kinesisConfig("regionName")) > .streamName(kinesisConfig("streamName")) > .initialPositionInStream(InitialPositionInStream.LATEST) > .checkpointAppName(kinesisConfig("appName")) > > .checkpointInterval(Seconds(kinesisConfig("kinesisCheckpointInterval").toInt)) > .storageLevel(StorageLevel.MEMORY_AND_DISK_2) > .kinesisCredentials(awsCredentials.build()) > .build() > } > val unionStreams = ssc.union(kinesisStreams) > val lines = unionStreams.flatMap(byteArray => new String(byteArray).split(" > ")) > lines.print(2) > lines.foreachRDD(rdd => { > if > (!rdd.partitions.isEmpty){ > println("New records found\nmetrics count in the batch: > %s".format(rdd.count())) //works > println("performing transformations") > rdd.saveAsTextFile("path")//works > import sparkSession.implicits._ > println(rdd.toString()) //not working > val records = rdd.toDF("records") //not working > println(records.take(2)) //not working > println(records.count()) //not working > } > else > println("No new record found") > }) > > {code} > > Attaching Thread dump: > h3. Thread dump for executor 2 > Updated at 2019/01/12 10:22:52 > > Collapse All > Search: > > ||Thread ID||Thread Name||Thread State||Thread Locks|| > |65|Executor task launch worker for task > 70|WAITING|Lock(java.util.concurrent.ThreadPoolExecutor$Worker@1560902703})| > |sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchro
[jira] [Assigned] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala
[ https://issues.apache.org/jira/browse/SPARK-26610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26610: Assignee: (was: Apache Spark) > Fix inconsistency between toJSON Method in Python and Scala > --- > > Key: SPARK-26610 > URL: https://issues.apache.org/jira/browse/SPARK-26610 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string > instead of {{RDD}}. The method in Scala/Java was changed to return > {{DataFrame}} before, but the one in PySpark was not changed at that time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala
[ https://issues.apache.org/jira/browse/SPARK-26610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26610: Assignee: Apache Spark > Fix inconsistency between toJSON Method in Python and Scala > --- > > Key: SPARK-26610 > URL: https://issues.apache.org/jira/browse/SPARK-26610 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string > instead of {{RDD}}. The method in Scala/Java was changed to return > {{DataFrame}} before, but the one in PySpark was not changed at that time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala
Takuya Ueshin created SPARK-26610: - Summary: Fix inconsistency between toJSON Method in Python and Scala Key: SPARK-26610 URL: https://issues.apache.org/jira/browse/SPARK-26610 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Takuya Ueshin {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string instead of {{RDD}}. The method in Scala/Java was changed to return {{DataFrame}} before, but the one in PySpark was not changed at that time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit categories.
[ https://issues.apache.org/jira/browse/SPARK-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26466: Assignee: Apache Spark > Use ConfigEntry for hardcoded configs for submit categories. > > > Key: SPARK-26466 > URL: https://issues.apache.org/jira/browse/SPARK-26466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Make the following hardcoded configs to use {{ConfigEntry}}. > {code} > spark.kryo > spark.kryoserializer > spark.jars > spark.submit > spark.serializer > spark.deploy > spark.worker > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit categories.
[ https://issues.apache.org/jira/browse/SPARK-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26466: Assignee: (was: Apache Spark) > Use ConfigEntry for hardcoded configs for submit categories. > > > Key: SPARK-26466 > URL: https://issues.apache.org/jira/browse/SPARK-26466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Make the following hardcoded configs to use {{ConfigEntry}}. > {code} > spark.kryo > spark.kryoserializer > spark.jars > spark.submit > spark.serializer > spark.deploy > spark.worker > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-26450. --- Resolution: Fixed Assignee: Bruce Robbins Fix Version/s: 3.0.0 > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 3.0.0 > > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23182) Allow enabling of TCP keep alive for RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23182: - Assignee: Petar Petrov > Allow enabling of TCP keep alive for RPC connections > > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.4.0 >Reporter: Petar Petrov >Assignee: Petar Petrov >Priority: Minor > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23182) Allow enabling of TCP keep alive for RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23182. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20512 [https://github.com/apache/spark/pull/20512] > Allow enabling of TCP keep alive for RPC connections > > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.4.0 >Reporter: Petar Petrov >Assignee: Petar Petrov >Priority: Minor > Fix For: 3.0.0 > > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24497) Support recursive SQL query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24497: Assignee: Apache Spark > Support recursive SQL query > --- > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24497) Support recursive SQL query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24497: Assignee: (was: Apache Spark) > Support recursive SQL query > --- > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer
[ https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-25853: - Fix Version/s: (was: 2.3.3) > Parts of spark components (DAG Visualizationand executors page) not available > in Internet Explorer > -- > > Key: SPARK-25853 > URL: https://issues.apache.org/jira/browse/SPARK-25853 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0, 2.3.2 >Reporter: aastha >Priority: Major > Attachments: dag_error_ie.png, dag_not_rendered_ie.png, > dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, > executors_on_chrome.png > > > Spark UI has come limitations when working with Internet Explorer. The DAG > component as well as Executors page does not render, it works on Firefox and > Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works > on Chrome and Firefox their versions should not matter. > For executors page, the root cause is that document.baseURI property is > undefined in Internet Explorer. When I debug by providing the property > myself, it shows up fine. > For DAG component, developer tools haven't helped. > Attaching screenshots for Chrome and IE UI and debug console messages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer
[ https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741511#comment-16741511 ] Takeshi Yamamuro commented on SPARK-25853: -- Yea, thanks for pinging me. Dropped. > Parts of spark components (DAG Visualizationand executors page) not available > in Internet Explorer > -- > > Key: SPARK-25853 > URL: https://issues.apache.org/jira/browse/SPARK-25853 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0, 2.3.2 >Reporter: aastha >Priority: Major > Attachments: dag_error_ie.png, dag_not_rendered_ie.png, > dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, > executors_on_chrome.png > > > Spark UI has come limitations when working with Internet Explorer. The DAG > component as well as Executors page does not render, it works on Firefox and > Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works > on Chrome and Firefox their versions should not matter. > For executors page, the root cause is that document.baseURI property is > undefined in Internet Explorer. When I debug by providing the property > myself, it shows up fine. > For DAG component, developer tools haven't helped. > Attaching screenshots for Chrome and IE UI and debug console messages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer
[ https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741508#comment-16741508 ] Dongjoon Hyun commented on SPARK-25853: --- cc [~maropu]. It seems that you need to remove the fix version of this issue. > Parts of spark components (DAG Visualizationand executors page) not available > in Internet Explorer > -- > > Key: SPARK-25853 > URL: https://issues.apache.org/jira/browse/SPARK-25853 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0, 2.3.2 >Reporter: aastha >Priority: Major > Fix For: 2.3.3 > > Attachments: dag_error_ie.png, dag_not_rendered_ie.png, > dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, > executors_on_chrome.png > > > Spark UI has come limitations when working with Internet Explorer. The DAG > component as well as Executors page does not render, it works on Firefox and > Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works > on Chrome and Firefox their versions should not matter. > For executors page, the root cause is that document.baseURI property is > undefined in Internet Explorer. When I debug by providing the property > myself, it shows up fine. > For DAG component, developer tools haven't helped. > Attaching screenshots for Chrome and IE UI and debug console messages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org