RE: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau
But it forces you to create your own SparkContext, which I’d rather not do. Also it doesn’t seem to allow me to directly create a table from a DataFrame, as follow: TestHive.createDataFrame[MyType](rows).write.saveAsTable("a_table") From: Xin Wu [mailto:xwu0...@gmail.com] Sent: 13 janvier 2017

filter rows based on all columns

2017-01-13 Thread Xiaomeng Wan
I need to filter out outliers from a dataframe on all columns. I can manually list all columns like: df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0)) .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1 )) ... But I want to turn it into a

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
In terms of the nullPointerException, i think it is bug. since the test data directories might be moved already. so it failed to load the test data to create the test tables. You may create a jira for this. On Fri, Jan 13, 2017 at 11:44 AM, Xin Wu wrote: > If you are using

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
If you are using spark-shell, you have instance "sc" as the SparkContext initialized already. If you are writing your own application, you need to create a SparkSession, which comes with the SparkContext. So you can reference it like sparkSession.sparkContext. In terms of creating a table from

Re: [Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Xin Wu
I used the following: val testHive = new org.apache.spark.sql.hive.test.TestHiveContext(sc, *false*) val hiveClient = testHive.sessionState.metadataHive hiveClient.runSqlHive(“….”) On Fri, Jan 13, 2017 at 6:40 AM, Nicolas Tallineau < nicolas.tallin...@ubisoft.com> wrote: > I get a

Debugging a PythonException with no details

2017-01-13 Thread Nicholas Chammas
I’m looking for tips on how to debug a PythonException that’s very sparse on details. The full exception is below, but the only interesting bits appear to be the following lines: org.apache.spark.api.python.PythonException: ... py4j.protocol.Py4JError: An error occurred while calling

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Tathagata Das
Structured Streaming has a foreach sink, where you can essentially do what you want with your data. Its easy to create a Kafka producer, and write the data out to kafka. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach On Fri, Jan 13, 2017 at 8:28 AM,

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Koert Kuipers
how do you do this with structured streaming? i see no mention of writing to kafka On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian wrote: > Yes, it is called Structured Streaming: https://docs. > databricks.com/_static/notebooks/structured-streaming-kafka.html >

Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-13 Thread shyla deshpande
Hello, My spark streaming app that reads kafka topics and prints the DStream works fine on my laptop, but on AWS cluster it produces no output and no errors. Please help me debug. I am using Spark 2.0.2 and kafka-0-10 Thanks The following is the output of the spark streaming app... 17/01/14

[Spark SQL - Scala] TestHive not working in Spark 2

2017-01-13 Thread Nicolas Tallineau
I get a nullPointerException as soon as I try to execute a TestHive.sql(...) statement since migrating to Spark 2 because it's trying to load non existing "test tables". I couldn't find a way to switch to false the loadTestTables variable. Caused by: sbt.ForkMain$ForkError:

Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Senthil Kumar
Hi Team , Sorry if this question already asked in this forum.. Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ?? Here is my Code which Reads Parquet File : *val sqlContext = new org.apache.spark.sql.SQLContext(sc);* *val df =

Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Anahita Talebi
Hi, Thanks for your answer. I have chose "Spark" in the "job type". There is not any option where we can choose the version. How I can choose different version? Thanks, Anahita On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh wrote: > You may have tested this code on Spark

Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Dinko Srkoč
On 13 January 2017 at 13:55, Anahita Talebi wrote: > Hi, > > Thanks for your answer. > > I have chose "Spark" in the "job type". There is not any option where we can > choose the version. How I can choose different version? There's "Preemptible workers, bucket,

Re: Spark in docker over EC2

2017-01-13 Thread Teng Qiu
Hi, you can take a look at this project, it is a distributed HA Spark cluster for AWS environment using Docker, we put the spark ec2 instances in an ELB, and using this code snippet to get the instance IPs: https://github.com/zalando-incubator/spark-appliance/blob/master/utils.py#L49-L56

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Peyman Mohajerian
Yes, it is called Structured Streaming: https://docs.databricks.com/_static/notebooks/structured-streaming-kafka.html http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar wrote: > Hi Team , > >

Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Anahita Talebi
Hello, Thanks a lot Dinko. Yes, now it is working perfectly. Cheers, Anahita On Fri, Jan 13, 2017 at 2:19 PM, Dinko Srkoč wrote: > On 13 January 2017 at 13:55, Anahita Talebi > wrote: > > Hi, > > > > Thanks for your answer. > > > > I have

Re: Schema evolution in tables

2017-01-13 Thread sim
There is not automated solution right now. You have to issue manual ALTER TABLE commands, which works for adding top-level columns but gets tricky if you are adding a field in a deeply nested struct. Hopefully, the issue will be fixed in 2.2 because work has started on