Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, This might work then df.coalesce(1).write.option("header","true").mode("overwrite ").text("output") Regards, Snehasish On Wed, Feb 21, 2018 at 3:21 AM, Mina Aslani wrote: > Hi Snehasish, > > Using df.coalesce(1).write.option("header","true").mode("overwrite >

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this On Tue, Feb 20, 2018 at 7:52 PM, kant kodali wrote: > Hi All, > > I have the following code > > import org.apache.spark.sql.streaming.Trigger > > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", >

Job never finishing

2018-02-20 Thread Nikhil Goyal
Hi guys, I have a job which gets stuck if a couple of tasks get killed due to OOM exception. Spark doesn't kill the job and it keeps on running for hours. Ideally I would expect Spark to kill the job or restart the killed executors but nothing seems to be happening. Anybody got idea about this?

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, Even text won't work you may try this df.coalesce(1).write.option("h eader","true").mode("overwrite").save("output",format=text) Else convert to an rdd and use saveAsTextFile Regards, Snehasish On Wed, Feb 21, 2018 at 3:38 AM, SNEHASISH DUTTA wrote: > Hi

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread vermanurag
If your dataframe has columns types like vector then you cannot save as csv/ text as there are no direct equivalent supported by flat formats like csv/ text. You may need to convert the column type appropriately (eg. convert the incompatible column to StringType before saving the output as csv.

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi Snehasish, Unfortunately, none of the solutions worked. Regards, Mina On Tue, Feb 20, 2018 at 5:12 PM, SNEHASISH DUTTA wrote: > Hi Mina, > > Even text won't work you may try this df.coalesce(1).write.option("h >

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
if I change it to the below code it works. However, I don't believe it is the solution I am looking for. I want to be able to do it in raw SQL and moreover, If a user gives a big chained raw spark SQL join query I am not even sure how to make copies of the dataframe to achieve the self-join. Is

Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung
No it does not support bi directional edges as of now. _ From: xiaobo Sent: Tuesday, February 20, 2018 4:35 AM Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships To: Felix Cheung ,

Re: Can spark handle this scenario?

2018-02-20 Thread Lian Jiang
Thanks Vijay! This is very clear. On Tue, Feb 20, 2018 at 12:47 AM, vijay.bvp wrote: > I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API > with some token passed, in the code provided so far if you have 2000 > symbols, it will make 2000 new

what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
Hi All, I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table")

Re: Job never finishing

2018-02-20 Thread Femi Anthony
You can use spark speculation as a way to get around the problem. Here is a useful link: http://asyncified.io/2016/08/13/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/ Sent from my iPhone > On Feb 20, 2018, at 5:52 PM, Nikhil Goyal wrote: >

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I was hoping that there is a casting vector into String method (instead of writing my UDF), so that it can then be serialized it into csv/text file. Best regards, Mina On Tue, Feb 20, 2018 at 6:52 PM, vermanurag wrote: > If your dataframe has columns types

Re: sqoop import job not working when spark thrift server is running.

2018-02-20 Thread akshay naidu
hello vijay, appreciate your reply. what was the error when you are trying to run mapreduce import job when > the > thrift server is running. it didnt throw any error, it just gets stuck at INFO mapreduce.Job: Running job: job_151911053 and resumes the moment i kill Thrift . thanks On Tue,

Re: Can spark handle this scenario?

2018-02-20 Thread vijay.bvp
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API with some token passed, in the code provided so far if you have 2000 symbols, it will make 2000 new connections!! and 2000 API calls connection objects can't/shouldn't be serialized and send to executors, they should

Re: sqoop import job not working when spark thrift server is running.

2018-02-20 Thread vijay.bvp
what was the error when you are trying to run mapreduce import job when the thrift server is running. this is only config changed? what was the config before... also share the spark thrift server job config such as no of executors, cores memory etc. My guess is your mapreduce job is unable to

Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-20 Thread Ramon Bejar
But, is it not possible to compute with both directions of an edge like it happens with graphX ? On 02/20/2018 03:01 AM, Felix Cheung wrote: Generally that would be the approach. But since you have effectively double the number of edges this will likely affect the scale your job will run.

Re: The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Sorry. please ignore. it works now! On Tue, Feb 20, 2018 at 5:41 AM, kant kodali wrote: > Hi All, > > I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I > can see my Dataframe has a schema like below. The timestamp column seems to > be same for every

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-20 Thread LongVehicle
Hi Vijay, Thanks for the follow-up. The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS read stage) is because we load the same HDFS data in different jobs, and these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven assignment problem that we had

The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Hi All, I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I can see my Dataframe has a schema like below. The timestamp column seems to be same for every record and I am not sure why? am I missing something (did I fail to configure something)? Thanks! Column Type key

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, This might help df.coalesce(1).write.option("header","true").mode("overwrite").csv("output") Regards, Snehasish On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani wrote: > Hi, > > I would like to serialize a dataframe with vector values into a text/csv > in pyspark. >

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi Snehasish, Using df.coalesce(1).write.option("header","true").mode("overwrite ").csv("output") throws java.lang.UnsupportedOperationException: CSV data source does not support struct<...> data type. Regards, Mina On Tue, Feb 20, 2018 at 4:36 PM, SNEHASISH DUTTA

Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen
Dear Apache Enthusiast, (You’re receiving this message because you’re subscribed to a user@ or dev@ list of one or more Apache Software Foundation projects.) We’re pleased to announce the upcoming ApacheCon [1] in Montréal, September 24-27. This event is all about you — the Apache project

Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I would like to serialize a dataframe with vector values into a text/csv in pyspark. Using below line, I can write the dataframe(e.g. df) as parquet, however I cannot open it in excel/as text. df.coalesce(1).write.option("header","true").mode(" overwrite").save("output") Best regards, Mina

Write a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi, I would like to write a dataframe with vactor values into a text/csv file. Using below line, I can write it as parquet, however I cannot open it in excel/as text. df.coalesce(1).write.option("header","true").mode("overwrite").save("stage-s3logs-model") Wondering how to save the result of a