Does dataframe spark API write/create a single file instead of directory as a result of write operation.

2020-02-21 Thread Kshitij
Hi,

There is no dataframe spark API which writes/creates a single file instead
of directory as a result of write operation.

Below both options will create directory with a random file name.

df.coalesce(1).write.csv()



df.write.csv()


Instead of creating directory with standard files (_SUCCESS , _committed ,
_started). I want a single file with file_name specified.


Thanks


PowerIterationClustering

2020-02-21 Thread Monish R
Hi guys,
I am new to mlib and trying out PowerIterationClustering as per the example
mentioned below,

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java


I am having trouble in understanding how the output is created.
For instance if i change the input as shown below, i would like to
understand how the algorithm arrived at grouping 0 and 2 , while keeping
the rest in another cluster.

k = 2 .

Input :
  new Tuple3<>(0L, 1L, 0.9),
  new Tuple3<>(1L, 2L, 0.7),
  new Tuple3<>(2L, 3L, 0.3),
  new Tuple3<>(3L, 4L, 0.5),
  new Tuple3<>(4L, 5L, 0.2)));

Output :
4 -> 0
0 -> 1
1 -> 0
3 -> 0
5 -> 0
2 -> 1

Kindly guide if you have any info on using the algorithm / point to some
materials that are suitable for beginners on this context.



Regards.


Re: Serialization error when using scala kernel with Jupyter

2020-02-21 Thread Apostolos N. Papadopoulos
collect() returns the contents of the RDD back to the Driver in a local 
variable. Where is the local variable?


Try

val result = rdd.map(x => x + 1).collect()

regards,

Apostolos



On 21/2/20 21:28, Nikhil Goyal wrote:

Hi all,
I am trying to use almond scala kernel to run spark session on 
Jupyter. I am using scala version 2.12.8. I am creating spark session 
with master set to Yarn.

This is the code:

val rdd = spark.sparkContext.parallelize(Seq(1, 2, 4))
rdd.map(x => x + 1).collect()

Exception:
java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in 
instance of org.apache.spark.rdd.MapPartitionsRDD


I was wondering if anyone has seen this before.

Thanks
Nikhil


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol



Serialization error when using scala kernel with Jupyter

2020-02-21 Thread Nikhil Goyal
Hi all,
I am trying to use almond scala kernel to run spark session on Jupyter. I
am using scala version 2.12.8. I am creating spark session with master set
to Yarn.
This is the code:

val rdd = spark.sparkContext.parallelize(Seq(1, 2, 4))
rdd.map(x => x + 1).collect()

Exception:

java.lang.ClassCastException: cannot assign instance of
java.lang.invoke.SerializedLambda to field
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in
instance of org.apache.spark.rdd.MapPartitionsRDD


I was wondering if anyone has seen this before.

Thanks
Nikhil


Spark RDD ouput path for data lineage

2020-02-21 Thread ard3nte
Hi, i am trying to do data lineage, so i need to extract output path from RDD
job (for example someRDD.saveAsTextFile("/path/")) using SparListener. How
can i do that?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org