Re: Oracle JDBC - Spark SQL - Key Not Found: Scale
Hi, Thank for your report! Sean had already told us why this happened here: https://issues.apache.org/jira/browse/SPARK-19392 // maropu On Mon, Feb 27, 2017 at 1:19 PM, ayan guhawrote: > Hi > > I am using CDH 5.8 build of Spark 1.6, so some patches may have been made. > > Environment: Oracle Big Data Appliance, comes with CDH 5.8. > I am using following command to launch: > > pyspark --driver-class-path $BDA_ORACLE_AUXJAR_PATH/ > kvclient.jar:$BDA_ORACLE_AUXJAR_PATH/ojdbc7.jar --conf > spark.jars=$BDA_ORACLE_AUXJAR_PATH/kvclient.jar,$BDA_ORACLE_ > AUXJAR_PATH/ojdbc7.jar > > >>> df = sqlContext.read.jdbc(url='jdbc:oracle:thin:@hostname: > 1521/DEVAIM',table="Table",properties="user":"user"," > password":"password","driver":"oracle.jdbc.OracleDriver"}) > > Traceback (most recent call last): > File "", line 1, in > File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/ > spark/python/pyspark/sql/readwriter.py", line 289, in jdbc > return self._df(self._jreader.jdbc(url, table, jprop)) > File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/ > spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in > __call__ > File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/ > spark/python/pyspark/sql/utils.py", line 45, in deco > return f(*a, **kw) > File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/ > spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in > get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o40.jdbc. > : java.util.NoSuchElementException: key not found: scale > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:58) > at org.apache.spark.sql.types.Metadata.get(Metadata.scala:108) > at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51) > at org.apache.spark.sql.jdbc.OracleDialect$. > getCatalystType(OracleDialect.scala:33) > at org.apache.spark.sql.execution.datasources.jdbc. > JDBCRDD$.resolveTable(JDBCRDD.scala:140) > at org.apache.spark.sql.execution.datasources.jdbc. > JDBCRelation.(JDBCRelation.scala:91) > at org.apache.spark.sql.DataFrameReader.jdbc( > DataFrameReader.scala:222) > at org.apache.spark.sql.DataFrameReader.jdbc( > DataFrameReader.scala:146) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke( > ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand. > java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > > Table Structure: > SURROGATE_KEY_ID NUMBER(19,0) No > SOURCE_KEY_PART_1 VARCHAR2(255 BYTE) No > SOURCE_KEY_PART_2 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_3 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_4 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_5 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_6 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_7 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_8 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_9 VARCHAR2(255 BYTE) Yes > SOURCE_KEY_PART_10 VARCHAR2(255 BYTE) Yes > SOURCE_SYSTEM_NAME VARCHAR2(50 BYTE) Yes > SOURCE_DOMAIN_NAME VARCHAR2(50 BYTE) Yes > EFFECTIVE_FROM_TIMESTAMP DATE No > EFFECTIVE_TO_TIMESTAMP DATE No > SESS_NO NUMBER(19,0) No > > HTHbut please feel free to let me know if I can help in any other > way... > > Best > Ayan > > > On Fri, Feb 3, 2017 at 3:18 PM, Takeshi Yamamuro > wrote: > >> -user +dev >> cc: xiao >> >> Hi, ayan, >> >> I made pr to fix the issue that your reported though, it seems all the >> releases I checked (e.g., v1.6, v2.0, v2.1) >> does not hit the issue. Could you described more about your environments >> and conditions? >> >> You first reported you used v1.6 though, I checked and found that the >> exception does not exist there. >> Do I miss anything? >> >> // maropu >> >> >> >> On Fri, Jan 27, 2017 at 11:10 AM, ayan guha wrote: >> >>> Hi >>> >>> I will do a little more testing and will let you know. It did not work >>> with INT and Number types, for sure. >>> >>> While writing, everything is fine :) >>> >>> On Fri, Jan 27, 2017 at 1:04 PM, Takeshi Yamamuro >> > wrote: >>> How about this?
Are we still dependent on Guava jar in Spark 2.1.0 as well?
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar incompatibility issues)?
Re: The driver hangs at DataFrame.rdd in Spark 2.1.0
Hi Stan, Looks like it is the same issue we are working to solve. Related PRs are: https://github.com/apache/spark/pull/16998 https://github.com/apache/spark/pull/16785 You can take a look of those PRs and help review too. Thanks. StanZhai wrote > Thanks for Cheng's help. > > > It must be something wrong with InferFiltersFromConstraints, I just > removed InferFiltersFromConstraints from > org/apache/spark/sql/catalyst/optimizer/Optimizer.scala to avoid this > issue. I will analysis this issue with the method you provided. > > > > > -- Original -- > From: "Cheng Lian [via Apache Spark Developers > List]";ml-node+s1001551n21069...@n3.nabble.com; > Send time: Friday, Feb 24, 2017 2:28 AM > To: "Stan Zhai"m...@zhaishidan.cn; > > Subject: Re: The driver hangs at DataFrame.rdd in Spark 2.1.0 > > > > > This one seems to be relevant, but it's already fixed in 2.1.0. > > One way to debug is to turn on trace log and check how the > analyzer/optimizer behaves. > > > On 2/22/17 11:11 PM, StanZhai wrote: > > Could this be related to > https://issues.apache.org/jira/browse/SPARK-17733 ? > > > > > -- Original -- > From: "Cheng Lian-3 [via Apache Spark Developers > > List]";<[hidden email]>; >Send time: Thursday, Feb 23, 2017 9:43 AM >To: "Stan Zhai"<[hidden email]>; >Subject: Re: The driver hangs at DataFrame.rdd in > Spark 2.1.0 > > > > > Just from the thread dump you provided, it seems that this > particular query plan jams our optimizer. However, it's also > possible that the driver just happened to be running optimizer > rules at that particular time point. > > > Since query planning doesn't touch any actual data, could you > please try to minimize this query by replacing the actual > relations with temporary views derived from Scala local > collections? In this way, it would be much easier for others to > reproduce issue. > > Cheng > > > On 2/22/17 5:16 PM, Stan Zhai wrote: > > Thanks for lian's reply. > > >Here is the QueryPlan generated by Spark 1.6.2(I can't > get it in Spark 2.1.0): > ... > > > > -- Original -- > Subject: Re: The driver hangs at > DataFrame.rdd in Spark 2.1.0 > > > > > What is the query plan? We had once observed query plans > that grow exponentially in iterative ML workloads and the > query planner hangs forever. For example, each iteration > combines 4 plan trees of the last iteration and forms a > larger plan tree. The size of the plan tree can easily reach > billions of nodes after 15 iterations. > > > On 2/22/17 9:29 AM, Stan Zhai wrote: > > Hi all, > > >The driver hangs at DataFrame.rdd in Spark 2.1.0 when > > the DataFrame(SQL) is complex, Following thread dump of my > driver: >... > > > > > > > If you reply to this email, your message > will be added to the discussion below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html > > To start a new topic under Apache Spark Developers > List, email [hidden email] >To unsubscribe from Apache Spark Developers List, click here. >NAML > > > >View this message in context: Re: The driver hangs at > DataFrame.rdd in Spark 2.1.0 >Sent from the Apache Spark Developers List mailing list > archive at Nabble.com. > > > > > If you reply to this email, your message will be added > to the > discussion below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21069.html > > To start a new topic under Apache Spark Developers > List, email >
Re: The driver hangs at DataFrame.rdd in Spark 2.1.0
Hi Stan, Looks like it is the same issue we are working to solve. Related PRs are: https://github.com/apache/spark/pull/16998 https://github.com/apache/spark/pull/16785 You can take a look of those PRs and help review too. Thanks. StanZhai wrote > Hi all, > > > The driver hangs at DataFrame.rdd in Spark 2.1.0 when the DataFrame(SQL) > is complex, Following thread dump of my driver: > > > org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:230) > org.apache.spark.sql.catalyst.expressions.IsNotNull.equals(nullExpressions.scala:312) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) > scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151) > scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) > scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139) > scala.collection.mutable.HashSet.addElem(HashSet.scala:40) > scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59) > scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46) > scala.collection.mutable.HashSet.clone(HashSet.scala:83) > scala.collection.mutable.HashSet.clone(HashSet.scala:40) > org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65) > org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50) > scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) > scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104) > scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151) > scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104) > scala.collection.SetLike$class.$plus$plus(SetLike.scala:141) > org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50) > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:300) > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:297) > scala.collection.immutable.List.foreach(List.scala:381) > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:297) > org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:58) > org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187) > => holding > Monitor(org.apache.spark.sql.catalyst.plans.logical.Join@1365611745}) >
Re: A DataFrame cache bug
Hi Gen, I submitted a PR to fix the issue of refreshByPath: https://github.com/apache/spark/pull/17064 Thanks. tgbaggio wrote > Hi, The example that I provided is not very clear. And I add a more clear > example in jira. > > Thanks > > Cheers > Gen > > On Wed, Feb 22, 2017 at 3:47 PM, gen tang > gen.tang86@ > wrote: > >> Hi Kazuaki Ishizaki >> >> Thanks a lot for your help. It works. However, a more strange bug appears >> as follows: >> >> import org.apache.spark.sql.DataFrame >> import org.apache.spark.sql.SparkSession >> >> def f(path: String, spark: SparkSession): DataFrame = { >> val data = spark.read.option("mergeSchema", "true").parquet(path) >> println(data.count) >> val df = data.filter("id>10") >> df.cache >> println(df.count) >> val df1 = df.filter("id>11") >> df1.cache >> println(df1.count) >> df1 >> } >> >> val dir = "/tmp/test" >> spark.range(100).write.mode("overwrite").parquet(dir) >> spark.catalog.refreshByPath(dir) >> f(dir, spark).count // output 88 which is correct >> >> spark.range(1000).write.mode("overwrite").parquet(dir) >> spark.catalog.refreshByPath(dir) >> f(dir, spark).count // output 88 which is incorrect >> >> If we move refreshByPath into f(), just before spark.read. The whole code >> works fine. >> >> Any idea? Thanks a lot >> >> Cheers >> Gen >> >> >> On Wed, Feb 22, 2017 at 2:22 PM, Kazuaki Ishizaki > ISHIZAKI@.ibm > >> wrote: >> >>> Hi, >>> Thank you for pointing out the JIRA. >>> I think that this JIRA suggests you to insert >>> "spark.catalog.refreshByPath(dir)". >>> >>> val dir = "/tmp/test" >>> spark.range(100).write.mode("overwrite").parquet(dir) >>> val df = spark.read.parquet(dir) >>> df.count // output 100 which is correct >>> f(df).count // output 89 which is correct >>> >>> spark.range(1000).write.mode("overwrite").parquet(dir) >>> spark.catalog.refreshByPath(dir) // insert a NEW statement >>> val df1 = spark.read.parquet(dir) >>> df1.count // output 1000 which is correct, in fact other operation >>> expect >>> df1.filter("id>10") return correct result. >>> f(df1).count // output 89 which is incorrect >>> >>> Regards, >>> Kazuaki Ishizaki >>> >>> >>> >>> From:gen tang > gen.tang86@ > >>> To: > dev@.apache >>> Date:2017/02/22 15:02 >>> Subject:Re: A DataFrame cache bug >>> -- >>> >>> >>> >>> Hi All, >>> >>> I might find a related issue on jira: >>> >>> *https://issues.apache.org/jira/browse/SPARK-15678* >>> https://issues.apache.org/jira/browse/SPARK-15678; >>> >>> This issue is closed, may be we should reopen it. >>> >>> Thanks >>> >>> Cheers >>> Gen >>> >>> >>> On Wed, Feb 22, 2017 at 1:57 PM, gen tang <* > gen.tang86@ > * >>> > gen.tang86@ > > wrote: >>> Hi All, >>> >>> I found a strange bug which is related with reading data from a updated >>> path and cache operation. >>> Please consider the following code: >>> >>> import org.apache.spark.sql.DataFrame >>> >>> def f(data: DataFrame): DataFrame = { >>> val df = data.filter("id>10") >>> df.cache >>> df.count >>> df >>> } >>> >>> f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is >>> correct >>> f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which >>> is correct >>> >>> val dir = "/tmp/test" >>> spark.range(100).write.mode("overwrite").parquet(dir) >>> val df = spark.read.parquet(dir) >>> df.count // output 100 which is correct >>> f(df).count // output 89 which is correct >>> >>> spark.range(1000).write.mode("overwrite").parquet(dir) >>> val df1 = spark.read.parquet(dir) >>> df1.count // output 1000 which is correct, in fact other operation >>> expect >>> df1.filter("id>10") return correct result. >>> f(df1).count // output 89 which is incorrect >>> >>> In fact when we use df1.filter("id>10"), spark will however use old >>> cached dataFrame >>> >>> Any idea? Thanks a lot >>> >>> Cheers >>> Gen >>> >>> >>> >> - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-DataFrame-cache-bug-tp21044p21082.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org