[Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-28 Thread Zhang, Yuqi
Hello guys, I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem regarding using spark 2.4 client mode function on kubernetes cluster, so I would like to ask if there is some solution to my problem. The problem is when I am trying to run spark-shell on kubernetes v1.11.3

[GraphX] - OOM Java Heap Space

2018-10-28 Thread Thodoris Zois
Hello, I have the edges of a graph stored as parquet files (about 3GB). I am loading the graph and trying to compute the total number of triplets and triangles. Here is my code: val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + "/year=" + year) val edges:

Number of rows divided by rowsPerBlock cannot exceed maximum integer

2018-10-28 Thread Soheil Pourbafrani
Hi, Doing cartesian multiplication against a matrix, I got the error: pyspark.sql.utils.IllegalArgumentException: requirement failed: Number of rows divided by rowsPerBlock cannot exceed maximum integer. Here is the code: normalizer = Normalizer(inputCol="feature", outputCol="norm") data =

Re: Processing Flexibility Between RDD and Dataframe API

2018-10-28 Thread Adrienne Kole
Thanks for bringing this issue to the mailing list. As an addition, I would also ask the same questions about DStreams and Structured Streaming APIs. Structured Streaming is high level and it makes difficult to express all business logic in it, although Databricks are pushing it and recommending

Processing Flexibility Between RDD and Dataframe API

2018-10-28 Thread Soheil Pourbafrani
Hi, There are some functions like map, flatMap, reduce and ..., that construct the base data processing operation in big data (and Apache Spark). But Spark, in new versions, introduces the high-level Dataframe API and recommend using it. This is while there are no such functions in Dataframe API