Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-02-26 Thread Takeshi Yamamuro
Hi,

Thank for your report!
Sean had already told us why this happened here:
https://issues.apache.org/jira/browse/SPARK-19392

// maropu

On Mon, Feb 27, 2017 at 1:19 PM, ayan guha  wrote:

> Hi
>
> I am using CDH 5.8 build of Spark 1.6, so some patches may have been made.
>
> Environment: Oracle Big Data Appliance, comes with CDH 5.8.
> I am using following command to launch:
>
> pyspark --driver-class-path $BDA_ORACLE_AUXJAR_PATH/
> kvclient.jar:$BDA_ORACLE_AUXJAR_PATH/ojdbc7.jar   --conf
> spark.jars=$BDA_ORACLE_AUXJAR_PATH/kvclient.jar,$BDA_ORACLE_
> AUXJAR_PATH/ojdbc7.jar
>
> >>> df = sqlContext.read.jdbc(url='jdbc:oracle:thin:@hostname:
> 1521/DEVAIM',table="Table",properties="user":"user","
> password":"password","driver":"oracle.jdbc.OracleDriver"})
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/
> spark/python/pyspark/sql/readwriter.py", line 289, in jdbc
> return self._df(self._jreader.jdbc(url, table, jprop))
>   File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/
> spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in
> __call__
>   File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/
> spark/python/pyspark/sql/utils.py", line 45, in deco
> return f(*a, **kw)
>   File "/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p2001.2081/lib/
> spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in
> get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o40.jdbc.
> : java.util.NoSuchElementException: key not found: scale
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:58)
> at org.apache.spark.sql.types.Metadata.get(Metadata.scala:108)
> at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51)
> at org.apache.spark.sql.jdbc.OracleDialect$.
> getCatalystType(OracleDialect.scala:33)
> at org.apache.spark.sql.execution.datasources.jdbc.
> JDBCRDD$.resolveTable(JDBCRDD.scala:140)
> at org.apache.spark.sql.execution.datasources.jdbc.
> JDBCRelation.(JDBCRelation.scala:91)
> at org.apache.spark.sql.DataFrameReader.jdbc(
> DataFrameReader.scala:222)
> at org.apache.spark.sql.DataFrameReader.jdbc(
> DataFrameReader.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(
> ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
> java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)
>
> Table Structure:
> SURROGATE_KEY_ID NUMBER(19,0) No
> SOURCE_KEY_PART_1 VARCHAR2(255 BYTE) No
> SOURCE_KEY_PART_2 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_3 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_4 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_5 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_6 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_7 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_8 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_9 VARCHAR2(255 BYTE) Yes
> SOURCE_KEY_PART_10 VARCHAR2(255 BYTE) Yes
> SOURCE_SYSTEM_NAME VARCHAR2(50 BYTE) Yes
> SOURCE_DOMAIN_NAME VARCHAR2(50 BYTE) Yes
> EFFECTIVE_FROM_TIMESTAMP DATE No
> EFFECTIVE_TO_TIMESTAMP DATE No
> SESS_NO NUMBER(19,0) No
>
> HTHbut please feel free to let me know if I can help in any other
> way...
>
> Best
> Ayan
>
>
> On Fri, Feb 3, 2017 at 3:18 PM, Takeshi Yamamuro 
> wrote:
>
>> -user +dev
>> cc: xiao
>>
>> Hi, ayan,
>>
>> I made pr to fix the issue that your reported though, it seems all the
>> releases I checked (e.g., v1.6, v2.0, v2.1)
>> does not hit the issue. Could you described more about your environments
>> and conditions?
>>
>> You first reported you used v1.6 though, I checked and found that the
>> exception does not exist there.
>> Do I miss anything?
>>
>> // maropu
>>
>>
>>
>> On Fri, Jan 27, 2017 at 11:10 AM, ayan guha  wrote:
>>
>>> Hi
>>>
>>> I will do a little more testing and will let you know. It did not work
>>> with INT and Number types, for sure.
>>>
>>> While writing, everything is fine :)
>>>
>>> On Fri, Jan 27, 2017 at 1:04 PM, Takeshi Yamamuro >> > wrote:
>>>
 How about this?
 

Are we still dependent on Guava jar in Spark 2.1.0 as well?

2017-02-26 Thread kant kodali
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar
incompatibility issues)?


Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
Hi Stan,

Looks like it is the same issue we are working to solve. Related PRs are:

https://github.com/apache/spark/pull/16998
https://github.com/apache/spark/pull/16785

You can take a look of those PRs and help review too. Thanks. 


StanZhai wrote
> Thanks for Cheng's help.
> 
> 
> It must be something wrong with InferFiltersFromConstraints, I just
> removed InferFiltersFromConstraints from
> org/apache/spark/sql/catalyst/optimizer/Optimizer.scala to avoid this
> issue. I will analysis this issue with the method you provided.
> 
> 
> 
> 
> -- Original --
> From:  "Cheng Lian [via Apache Spark Developers
> List]";ml-node+s1001551n21069...@n3.nabble.com;
> Send time: Friday, Feb 24, 2017 2:28 AM
> To: "Stan Zhai"m...@zhaishidan.cn; 
> 
> Subject:  Re: The driver hangs at DataFrame.rdd in Spark 2.1.0
> 
> 
> 
>  
> This one seems to be relevant, but it's already fixed in 2.1.0.
>  
> One way to debug is to turn on trace log and check how the  
> analyzer/optimizer behaves.
>  
>  
>  On 2/22/17 11:11 PM, StanZhai wrote:
>  
> Could this be related to
> https://issues.apache.org/jira/browse/SPARK-17733 ?
> 
>  
>  
>  
>  -- Original --
> From:  "Cheng Lian-3 [via Apache Spark Developers 
>
> List]";<[hidden   email]>;
>Send time: Thursday, Feb 23, 2017 9:43 AM
>To: "Stan Zhai"<[hidden   email]>; 
>Subject:  Re: The driver hangs at DataFrame.rdd in
> Spark 2.1.0
>  
>  
>  
>  
> Just from the thread dump you provided, it seems that this  
> particular query plan jams our optimizer. However, it's also  
> possible that the driver just happened to be running optimizer  
> rules at that particular time point.
>  
>  
> Since query planning doesn't touch any actual data, could you  
> please try to minimize this query by replacing the actual  
> relations with temporary views derived from Scala local  
> collections? In this way, it would be much easier for others   to
> reproduce issue.
>  
> Cheng
>  
>  
>  On 2/22/17 5:16 PM, Stan Zhai   wrote:
>  
> Thanks for lian's reply.
>
>
>Here is the QueryPlan generated by Spark 1.6.2(I can't
> get it in Spark 2.1.0):
> ...   
> 
> 
>  
>  -- Original --
> Subject:  Re: The driver hangs at
> DataFrame.rdd in Spark 2.1.0
>  
>  
>  
>  
> What is the query plan? We had once observed query plans  
> that grow exponentially in iterative ML workloads and the  
> query planner hangs forever. For example, each iteration  
> combines 4 plan trees of the last iteration and forms a  
> larger plan tree. The size of the plan tree can easily   reach
> billions of nodes after 15 iterations.
>  
>  
>  On 2/22/17 9:29 AM, Stan Zhai   wrote:
>  
> Hi all,
>
>
>The driver hangs at DataFrame.rdd in Spark 2.1.0 when  
>   
> the DataFrame(SQL) is complex, Following thread dump of my
> driver:
>...
>   
>
>   
>  
>  
>  
> If you reply to this email, your message
> will be added to the discussion below:
>   
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html
>  
> To start a new topic under Apache Spark Developers
> List, email   [hidden email]   
>To unsubscribe from Apache Spark Developers List, click here.
>NAML 
>
>
>
>View this message in context: Re: The driver hangs at
> DataFrame.rdd in Spark 2.1.0
>Sent from the Apache Spark Developers List mailing list
> archive at Nabble.com.
>   
>   
>   
>   
>   If you reply to this email, your message will be added 
> to the
> discussion below:
>   
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21069.html
>   
>   To start a new topic under Apache Spark Developers 
> List, email
> 

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-26 Thread Liang-Chi Hsieh
Hi Stan,

Looks like it is the same issue we are working to solve. Related PRs are:

https://github.com/apache/spark/pull/16998
https://github.com/apache/spark/pull/16785

You can take a look of those PRs and help review too. Thanks.



StanZhai wrote
> Hi all,
> 
> 
> The driver hangs at DataFrame.rdd in Spark 2.1.0 when the DataFrame(SQL)
> is complex, Following thread dump of my driver:
> 
> 
> org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:230)
> org.apache.spark.sql.catalyst.expressions.IsNotNull.equals(nullExpressions.scala:312)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315)
> scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151)
> scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
> scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
> scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
> scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
> scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
> scala.collection.mutable.HashSet.clone(HashSet.scala:83)
> scala.collection.mutable.HashSet.clone(HashSet.scala:40)
> org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
> org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
> scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
> scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
> scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
> scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
> scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151)
> scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
> scala.collection.SetLike$class.$plus$plus(SetLike.scala:141)
> org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50)
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:300)
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:297)
> scala.collection.immutable.List.foreach(List.scala:381)
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:297)
> org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:58)
> org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
> => holding
> Monitor(org.apache.spark.sql.catalyst.plans.logical.Join@1365611745})
> 

Re: A DataFrame cache bug

2017-02-26 Thread Liang-Chi Hsieh


Hi Gen,

I submitted a PR to fix the issue of refreshByPath:
https://github.com/apache/spark/pull/17064

Thanks.



tgbaggio wrote
> Hi, The example that I provided is not very clear. And I add a more clear
> example in jira.
> 
> Thanks
> 
> Cheers
> Gen
> 
> On Wed, Feb 22, 2017 at 3:47 PM, gen tang 

> gen.tang86@

>  wrote:
> 
>> Hi Kazuaki Ishizaki
>>
>> Thanks a lot for your help. It works. However, a more strange bug appears
>> as follows:
>>
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.sql.SparkSession
>>
>> def f(path: String, spark: SparkSession): DataFrame = {
>>   val data = spark.read.option("mergeSchema", "true").parquet(path)
>>   println(data.count)
>>   val df = data.filter("id>10")
>>   df.cache
>>   println(df.count)
>>   val df1 = df.filter("id>11")
>>   df1.cache
>>   println(df1.count)
>>   df1
>> }
>>
>> val dir = "/tmp/test"
>> spark.range(100).write.mode("overwrite").parquet(dir)
>> spark.catalog.refreshByPath(dir)
>> f(dir, spark).count // output 88 which is correct
>>
>> spark.range(1000).write.mode("overwrite").parquet(dir)
>> spark.catalog.refreshByPath(dir)
>> f(dir, spark).count // output 88 which is incorrect
>>
>> If we move refreshByPath into f(), just before spark.read. The whole code
>> works fine.
>>
>> Any idea? Thanks a lot
>>
>> Cheers
>> Gen
>>
>>
>> On Wed, Feb 22, 2017 at 2:22 PM, Kazuaki Ishizaki 

> ISHIZAKI@.ibm

> 
>> wrote:
>>
>>> Hi,
>>> Thank you for pointing out the JIRA.
>>> I think that this JIRA suggests you to insert
>>> "spark.catalog.refreshByPath(dir)".
>>>
>>> val dir = "/tmp/test"
>>> spark.range(100).write.mode("overwrite").parquet(dir)
>>> val df = spark.read.parquet(dir)
>>> df.count // output 100 which is correct
>>> f(df).count // output 89 which is correct
>>>
>>> spark.range(1000).write.mode("overwrite").parquet(dir)
>>> spark.catalog.refreshByPath(dir)  // insert a NEW statement
>>> val df1 = spark.read.parquet(dir)
>>> df1.count // output 1000 which is correct, in fact other operation
>>> expect
>>> df1.filter("id>10") return correct result.
>>> f(df1).count // output 89 which is incorrect
>>>
>>> Regards,
>>> Kazuaki Ishizaki
>>>
>>>
>>>
>>> From:gen tang 

> gen.tang86@

> 
>>> To:

> dev@.apache

>>> Date:2017/02/22 15:02
>>> Subject:Re: A DataFrame cache bug
>>> --
>>>
>>>
>>>
>>> Hi All,
>>>
>>> I might find a related issue on jira:
>>>
>>> *https://issues.apache.org/jira/browse/SPARK-15678*
>>> https://issues.apache.org/jira/browse/SPARK-15678;
>>>
>>> This issue is closed, may be we should reopen it.
>>>
>>> Thanks
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>> On Wed, Feb 22, 2017 at 1:57 PM, gen tang <*

> gen.tang86@

> *
>>> 

> gen.tang86@

> > wrote:
>>> Hi All,
>>>
>>> I found a strange bug which is related with reading data from a updated
>>> path and cache operation.
>>> Please consider the following code:
>>>
>>> import org.apache.spark.sql.DataFrame
>>>
>>> def f(data: DataFrame): DataFrame = {
>>>   val df = data.filter("id>10")
>>>   df.cache
>>>   df.count
>>>   df
>>> }
>>>
>>> f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is
>>> correct
>>> f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which
>>> is correct
>>>
>>> val dir = "/tmp/test"
>>> spark.range(100).write.mode("overwrite").parquet(dir)
>>> val df = spark.read.parquet(dir)
>>> df.count // output 100 which is correct
>>> f(df).count // output 89 which is correct
>>>
>>> spark.range(1000).write.mode("overwrite").parquet(dir)
>>> val df1 = spark.read.parquet(dir)
>>> df1.count // output 1000 which is correct, in fact other operation
>>> expect
>>> df1.filter("id>10") return correct result.
>>> f(df1).count // output 89 which is incorrect
>>>
>>> In fact when we use df1.filter("id>10"), spark will however use old
>>> cached dataFrame
>>>
>>> Any idea? Thanks a lot
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>>
>>





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-DataFrame-cache-bug-tp21044p21082.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org