[jira] [Commented] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error
[ https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198280#comment-15198280 ] Xiu (Joe) Guo commented on SPARK-12981: --- Yes [~fabboe], my PR will fix your scenario too. > Dataframe distinct() followed by a filter(udf) in pyspark throws a casting > error > > > Key: SPARK-12981 > URL: https://issues.apache.org/jira/browse/SPARK-12981 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.0 > Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8) >Reporter: Tom Arnfeld >Priority: Critical > > We noticed a regression when testing out an upgrade of Spark 1.6 for our > systems, where pyspark throws a casting exception when using `filter(udf)` > after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. > Here's a little notebook that demonstrates the exception clearly... > https://gist.github.com/tarnfeld/ab9b298ae67f697894cd > Though for the sake of here... the following code will throw an exception... > {code} > data.select(col("a")).distinct().filter(my_filter(col("a"))).count() > {code} > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to > org.apache.spark.sql.catalyst.plans.logical.Aggregate > {code} > Whereas not using a UDF does not throw any errors... > {code} > data.select(col("a")).distinct().filter("a = 1").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiu (Joe) Guo updated SPARK-13366: -- Description: Saw a comment from [~marmbrus] regarding Cartesian join for Datasets: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." was: Saw a comment from [~marmbrus] about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Priority: Minor > > Saw a comment from [~marmbrus] regarding Cartesian join for Datasets: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13366) Support Cartesian join for Datasets
[ https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiu (Joe) Guo updated SPARK-13366: -- Description: Saw a comment from [~marmbrus] about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." was: Saw a comment from Michael about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." > Support Cartesian join for Datasets > --- > > Key: SPARK-13366 > URL: https://issues.apache.org/jira/browse/SPARK-13366 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiu (Joe) Guo >Priority: Minor > > Saw a comment from [~marmbrus] about this: > "You will get a cartesian if you do a join/joinWith using lit(true) as the > condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13366) Support Cartesian join for Datasets
Xiu (Joe) Guo created SPARK-13366: - Summary: Support Cartesian join for Datasets Key: SPARK-13366 URL: https://issues.apache.org/jira/browse/SPARK-13366 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiu (Joe) Guo Priority: Minor Saw a comment from Michael about this: "You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149387#comment-15149387 ] Xiu (Joe) Guo commented on SPARK-13283: --- Yes, it is a different problem from [SPARK-13297|https://issues.apache.org/jira/browse/SPARK-13297]. We should escape the column name based on JdbcDialect. > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13297) [SQL] Backticks cannot be escaped in column names
[ https://issues.apache.org/jira/browse/SPARK-13297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145566#comment-15145566 ] Xiu (Joe) Guo commented on SPARK-13297: --- Looks like in the current [master branch|https://github.com/apache/spark/tree/42d656814f756599a2bc426f0e1f32bd4cc4470f], this problem is fixed. {code} scala> val columnName = "col`s" columnName: String = col`s scala> val rows = List(Row("foo"), Row("bar")) rows: List[org.apache.spark.sql.Row] = List([foo], [bar]) scala> val schema = StructType(Seq(StructField(columnName, StringType))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(col`s,StringType,true)) scala> val rdd = sc.parallelize(rows) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :28 scala> val df = sqlContext.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [col`s: string] scala> val selectingColumnName = "`" + columnName.replace("`", "``") + "`" selectingColumnName: String = `col``s` scala> selectingColumnName res0: String = `col``s` scala> val selectedDf = df.selectExpr(selectingColumnName) selectedDf: org.apache.spark.sql.DataFrame = [col`s: string] scala> selectedDf.show +-+ |col`s| +-+ | foo| | bar| +-+ {code} > [SQL] Backticks cannot be escaped in column names > - > > Key: SPARK-13297 > URL: https://issues.apache.org/jira/browse/SPARK-13297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Priority: Minor > > We want to use backticks to escape spaces & minus signs in column names. > Are we unable to escape backticks when a column name is surrounded by > backticks? > It is not documented in: > http://spark.apache.org/docs/latest/sql-programming-guide.html > In MySQL there is a way: double the backticks, but this trick doesn't work in > Spark-SQL. > Am I correct or just missing something? Is there a way to escape backticks > inside a column name when it is surrounded by backticks? > Code to reproduce the problem: > https://github.com/grzegorz-chilkiewicz/SparkSqlEscapeBacktick -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF
[ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145173#comment-15145173 ] Xiu (Joe) Guo commented on SPARK-13301: --- Hi Simone: How long is the string length for each row in col1? Can you do a: myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show(3, False) > PySpark Dataframe return wrong results with custom UDF > -- > > Key: SPARK-13301 > URL: https://issues.apache.org/jira/browse/SPARK-13301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: PySpark in yarn-client mode - CDH 5.5.1 >Reporter: Simone >Priority: Critical > > Using a User Defined Function in PySpark inside the withColumn() method of > Dataframe, gives wrong results. > Here an example: > from pyspark.sql import functions > import string > myFunc = functions.udf(lambda s: string.lower(s)) > myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() > |col1| col2|col3| > |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...| > |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| > |4F008903600A0133E...| Cristina|4f008903600a0133e...| > The results are wrong and seem to be random: some record are OK (for example > the third) some others NO (for example the first 2). > The problem seems not occur with Spark built-in functions: > from pyspark.sql.functions import * > myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() > Without the withColumn() method, results seems to be always correct: > myDF.select("col1", "col2", myFunc(myDF["col1"])).show() > This can be considered only in part a workaround because you have to list > each time all column of your Dataframe. > Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9414) HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)
[ https://issues.apache.org/jira/browse/SPARK-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131054#comment-15131054 ] Xiu (Joe) Guo commented on SPARK-9414: -- With the current master [b938301|https://github.com/apache/spark/commit/b93830126cc59a26e2cfb5d7b3c17f9cfbf85988], I could not reproduce this issue by doing: >From Hive 1.2.1 CLI: {code} create table test4DimBySpark (mydate int, hh int, x int, y int, height float, u float, v float, w float, ph float, phb float, p float, pb float, qva float, por float, qgraup float, qnice float, qnrain float, tke_pbl float, el_pbl float) partitioned by (zone int, z int, year int, month int); {code} In Spark-shell, use the first block of scala code from description to insert data. I see correct partition directories in /user/hive/warehouse and Hive can read the data back fine. Can you check with the newer versions of the code? It's probably fixed. > HiveContext:saveAsTable creates wrong partition for existing hive > table(append mode) > > > Key: SPARK-9414 > URL: https://issues.apache.org/jira/browse/SPARK-9414 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0. >Reporter: Chetan Dalal >Priority: Critical > > Raising this bug because I found this issue was ready reported on Apache mail > archive and I am facing a similar issue. > ---original-- > I am using spark 1.4 and HiveContext to append data into a partitioned > hive table. I found that the data insert into the table is correct, but the > partition(folder) created is totally wrong. > {code} > val schemaString = "zone z year month date hh x y height u v w ph phb > p pb qvapor qgraup qnice qnrain tke_pbl el_pbl" > val schema = > StructType( > schemaString.split(" ").map(fieldName => > if (fieldName.equals("zone") || fieldName.equals("z") || > fieldName.equals("year") || fieldName.equals("month") || > fieldName.equals("date") || fieldName.equals("hh") || > fieldName.equals("x") || fieldName.equals("y")) > StructField(fieldName, IntegerType, true) > else > StructField(fieldName, FloatType, true) > )) > val pairVarRDD = > sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(), > 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(), > 0.0.floatValue(),0.1.floatValue(),0.0.floatValue())) > )) > val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema) > partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource") > .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark") > {code} > - > The table contains 23 columns (longer than Tuple maximum length), so I > use Row Object to store raw data, not Tuple. > Here is some message from spark when it saved data>> > {code} > > 15/06/16 10:39:22 INFO metadata.Hive: Renaming > src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest: > hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true > > 15/06/16 10:39:22 INFO metadata.Hive: New loading path = > hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0 > with partSpec {zone=13195, z=0, year=0, month=0} > > From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month = > 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x) > > When I queried from hive>> > > hive> select * from test4dimBySpark; > OK > 242200931.00.0218.0365.09989.497 > 29.62711319.0717930.11982734-3174.681297735.2 16.389032 > -96.6289125135.3652.6476808E-50.0 13195000 > hive> select zone, z, year, month from test4dimBySpark; > OK > 13195000 > hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*; > Found 2 items > -rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39 > /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1 > > The data stored in the table is correct zone = 2, z = 42, year = 2009, > month = 3, but the partition created was wrong > "zone=13195/z=0/
[jira] [Commented] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format
[ https://issues.apache.org/jira/browse/SPARK-12262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099007#comment-15099007 ] Xiu (Joe) Guo commented on SPARK-12262: --- You might want to check out this JIRA: https://issues.apache.org/jira/browse/SPARK-6413 > describe extended doesn't return table on detail info tabled stored as > PARQUET format > - > > Key: SPARK-12262 > URL: https://issues.apache.org/jira/browse/SPARK-12262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: pin_zhang > > 1. start hive server with start-thriftserver.sh > 2. create table table1 (id int) ; > create table table2(id int) STORED AS PARQUET; > 3. describe extended table1 ; > return detailed info > 4. describe extended table2 ; > result has no detailed info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work
[ https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071703#comment-15071703 ] Xiu (Joe) Guo commented on SPARK-12521: --- Thanks [~hvanhovell] to clarifying this up. Maybe it is a good idea to make the doc clearer here to explicitly mention the bound are not supposed to be filters? [https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/DataFrameReader.html#jdbc(java.lang.String,%20java.lang.String,%20java.lang.String,%20long,%20long,%20int,%20java.util.Properties)] > DataFrame Partitions in java does not work > -- > > Key: SPARK-12521 > URL: https://issues.apache.org/jira/browse/SPARK-12521 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.5.2 >Reporter: Sergey Podolsky > > Hello, > Partition does not work in Java interface of the DataFrame: > {code} > SQLContext sqlContext = new SQLContext(sc); > Map options = new HashMap<>(); > options.put("driver", ORACLE_DRIVER); > options.put("url", ORACLE_CONNECTION_URL); > options.put("dbtable", > "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt"); > options.put("lowerBound", "2704225000"); > options.put("upperBound", "2704226000"); > options.put("partitionColumn", "ID"); > options.put("numPartitions", "10"); > DataFrame jdbcDF = sqlContext.load("jdbc", options); > List jobsRows = jdbcDF.collectAsList(); > System.out.println(jobsRows.size()); > {code} > gives while expected 1000. Is it because of big decimal of boundaries or > partitioins does not work at all in Java? > Thanks. > Sergey -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work
[ https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071347#comment-15071347 ] Xiu (Joe) Guo commented on SPARK-12521: --- In 1.5.2 {code}sqlContext.load(){code} is deprecated, but I can still reproduce with:{code}sqlContext.read.jdbc(){code} I don't think it is the size of your numbers. I can reproduce with small integers given as lowerBound/upperBound with my setup. Can you maybe try adding "L" at the end of your number to verify that it still gives wrong results? I think the problem is the lowerBound and upperBound are not honored here, Spark just retrieves every row instead of 1001 rows bounded in your case. > DataFrame Partitions in java does not work > -- > > Key: SPARK-12521 > URL: https://issues.apache.org/jira/browse/SPARK-12521 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.5.2 >Reporter: Sergey Podolsky > > Hello, > Partition does not work in Java interface of the DataFrame: > {code} > SQLContext sqlContext = new SQLContext(sc); > Map options = new HashMap<>(); > options.put("driver", ORACLE_DRIVER); > options.put("url", ORACLE_CONNECTION_URL); > options.put("dbtable", > "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt"); > options.put("lowerBound", "2704225000"); > options.put("upperBound", "2704226000"); > options.put("partitionColumn", "ID"); > options.put("numPartitions", "10"); > DataFrame jdbcDF = sqlContext.load("jdbc", options); > List jobsRows = jdbcDF.collectAsList(); > System.out.println(jobsRows.size()); > {code} > gives while expected 1000. Is it because of big decimal of boundaries or > partitioins does not work at all in Java? > Thanks. > Sergey -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format
[ https://issues.apache.org/jira/browse/SPARK-12262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070503#comment-15070503 ] Xiu (Joe) Guo commented on SPARK-12262: --- The property {code}spark.sql.hive.convertMetastoreParquet{code} promotes parquet tables with built-in access instead of going through Hive metastore route. I am thinking maybe the `describe extended` behavior here is intended? A workaround (or intended usage) would be: {code} set spark.sql.hive.convertMetastoreParquet=false; {code} before describing a parquet table. > describe extended doesn't return table on detail info tabled stored as > PARQUET format > - > > Key: SPARK-12262 > URL: https://issues.apache.org/jira/browse/SPARK-12262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: pin_zhang > > 1. start hive server with start-thriftserver.sh > 2. create table table1 (id int) ; > create table table2(id int) STORED AS PARQUET; > 3. describe extended table1 ; > return detailed info > 4. describe extended table2 ; > result has no detailed info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in
[ https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030645#comment-15030645 ] Xiu(Joe) Guo commented on SPARK-9701: - [~yhuai][~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give me some feedback? Thanks! > allow not automatically using HiveContext with spark-shell when hive support > built in > - > > Key: SPARK-9701 > URL: https://issues.apache.org/jira/browse/SPARK-9701 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Thomas Graves > > I build the spark jar with hive support as most of our grids have Hive. We > were bringing up a new YARN cluster that didn't have hive installed on it yet > which results in the spark-shell failing to launch: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > It would be nice to have a config or something to tell it not to instantiate > a HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in
[ https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030645#comment-15030645 ] Xiu(Joe) Guo edited comment on SPARK-9701 at 11/28/15 7:44 PM: --- [~yhuai], [~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give me some feedback? Thanks! was (Author: xguo27): [~yhuai][~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give me some feedback? Thanks! > allow not automatically using HiveContext with spark-shell when hive support > built in > - > > Key: SPARK-9701 > URL: https://issues.apache.org/jira/browse/SPARK-9701 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Thomas Graves > > I build the spark jar with hive support as most of our grids have Hive. We > were bringing up a new YARN cluster that didn't have hive installed on it yet > which results in the spark-shell failing to launch: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > It would be nice to have a config or something to tell it not to instantiate > a HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in
[ https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030641#comment-15030641 ] Xiu(Joe) Guo commented on SPARK-9701: - I think this is the same issue as SPARK-11562. > allow not automatically using HiveContext with spark-shell when hive support > built in > - > > Key: SPARK-9701 > URL: https://issues.apache.org/jira/browse/SPARK-9701 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Thomas Graves > > I build the spark jar with hive support as most of our grids have Hive. We > were bringing up a new YARN cluster that didn't have hive installed on it yet > which results in the spark-shell failing to launch: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > It would be nice to have a config or something to tell it not to instantiate > a HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030615#comment-15030615 ] Xiu(Joe) Guo commented on SPARK-12030: -- I tried your scenario with some TPCDS table last night, joined on integer columns, but could not reproduce incorrect results. Does your table have very large integer values which might overflow? > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Critical > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029226#comment-15029226 ] Xiu(Joe) Guo commented on SPARK-6644: - With the current master branch code line (1.6.0-snapshot), this issue cannot be reproduced anymore. {panel} scala> sqlContext.sql("DROP TABLE IF EXISTS table_with_partition ") res6: org.apache.spark.sql.DataFrame = [] scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING)") res7: org.apache.spark.sql.DataFrame = [result: string] scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData") res8: org.apache.spark.sql.DataFrame = [] scala> sqlContext.sql("select * from table_with_partition") res9: org.apache.spark.sql.DataFrame = [key: int, value: string, ds: string] scala> sqlContext.sql("select * from table_with_partition").show |key|value| ds| | 1|1| 1| | 2|2| 1| scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)") res11: org.apache.spark.sql.DataFrame = [result: string] scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") res12: org.apache.spark.sql.DataFrame = [result: string] scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData") res13: org.apache.spark.sql.DataFrame = [] scala> sqlContext.sql("SELECT * FROM table_with_partition").show |key|value|key1|destlng| ds| | 1|1|test| 1.11| 1| | 2|2|test| 1.11| 1| {panel} > After adding new columns to a partitioned table and inserting data to an old > partition, data of newly added columns are all NULL > > > Key: SPARK-6644 > URL: https://issues.apache.org/jira/browse/SPARK-6644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: dongxu > > In Hive, the schema of a partition may differ from the table schema. For > example, we may add new columns to the table after importing existing > partitions. When using {{spark-sql}} to query the data in a partition whose > schema is different from the table schema, problems may arise. Part of them > have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. > However, after adding new column(s) to the table, when inserting data into > old partitions, values of newly added columns are all {{NULL}}. > The following snippet can be used to reproduce this issue: > {code} > case class TestData(key: Int, value: String) > val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => > TestData(i, i.toString))).toDF() > testData.registerTempTable("testData") > sql("DROP TABLE IF EXISTS table_with_partition ") > sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) > PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'") > sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT > key, value FROM testData") > // Add new columns to the table > sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)") > sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") > sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT > key, value, 'test', 1.11 FROM testData") > sql("SELECT * FROM table_with_partition WHERE ds = > '1'").collect().foreach(println) > {code} > Actual result: > {noformat} > [1,1,null,null,1] > [2,2,null,null,1] > {noformat} > Expected result: > {noformat} > [1,1,test,1.11,1] > [2,2,test,1.11,1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11628) spark-sql do not support for column datatype of CHAR
[ https://issues.apache.org/jira/browse/SPARK-11628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999742#comment-14999742 ] Xiu(Joe) Guo commented on SPARK-11628: -- Hi Shunyu: I think you are right about the parser part, but on top of the parser, quite a few other places also need new code to handle the 'CHAR' type. I was looking at the entire stack trying not to miss anything, please look at my proposed change and see whether it looks good to you. Thanks! > spark-sql do not support for column datatype of CHAR > > > Key: SPARK-11628 > URL: https://issues.apache.org/jira/browse/SPARK-11628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: zhangshunyu > Labels: features > > In spark-sql when we create a table using the command as follwing: >"create table tablename(col char(5));" > Hive will support for creating the table, but when we desc the table: >"desc tablename" > spark will report the error: >“org.apache.spark.sql.types.DataTypeException: Unsupported dataType: > char(5). If you have a struct and a field name of it has any special > characters, please use backticks (`) to quote that field name, e.g. `x+y`. > Please note that backtick itself is not supported in a field name.” -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"
[ https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999101#comment-14999101 ] Xiu(Joe) Guo commented on SPARK-11631: -- I am looking at it, will submit a PR shortly. > DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no > corresponding "Starting" > > > Key: SPARK-11631 > URL: https://issues.apache.org/jira/browse/SPARK-11631 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 > Environment: Spark sources as of today - revision {{5039a49}} >Reporter: Jacek Laskowski >Priority: Trivial > > At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping > DAGScheduler}}, but there's no corresponding Starting INFO message. It can be > surprising. > I think Spark should have a change and pick one: > 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most > (or even TRACE) > 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding > {{INFO DAGScheduler: Starting DAGScheduler}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org