Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
 Hi Mina,
This might work then

df.coalesce(1).write.option("header","true").mode("overwrite
").text("output")

Regards,
Snehasish

On Wed, Feb 21, 2018 at 3:21 AM, Mina Aslani  wrote:

> Hi Snehasish,
>
> Using df.coalesce(1).write.option("header","true").mode("overwrite
> ").csv("output") throws
>
> java.lang.UnsupportedOperationException: CSV data source does not support
> struct<...> data type.
>
>
> Regards,
> Mina
>
>
>
>
> On Tue, Feb 20, 2018 at 4:36 PM, SNEHASISH DUTTA  > wrote:
>
>> Hi Mina,
>> This might help
>> df.coalesce(1).write.option("header","true").mode("overwrite
>> ").csv("output")
>>
>> Regards,
>> Snehasish
>>
>> On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani 
>> wrote:
>>
>>> Hi,
>>>
>>> I would like to serialize a dataframe with vector values into a text/csv
>>> in pyspark.
>>>
>>> Using below line, I can write the dataframe(e.g. df) as parquet, however
>>> I cannot open it in excel/as text.
>>> df.coalesce(1).write.option("header","true").mode("overwrite
>>> ").save("output")
>>>
>>> Best regards,
>>> Mina
>>>
>>>
>>
>


Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this




On Tue, Feb 20, 2018 at 7:52 PM, kant kodali  wrote:

> Hi All,
>
> I have the following code
>
> import org.apache.spark.sql.streaming.Trigger
>
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
>
> jdf.createOrReplaceTempView("table")
>
> val resultdf = spark.sql("select * from table as x inner join table as y on 
> x.offset=y.offset")
>
> resultdf.writeStream.outputMode("update").format("console").option("truncate",
>  false).trigger(Trigger.ProcessingTime(1000)).start()
>
> and I get the following exception.
>
> org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given 
> input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, 
> x.timestamp, x.partition]; line 1 pos 50;
> 'Project [*]
> +- 'Join Inner, ('x.offset = 'y.offset)
>:- SubqueryAlias x
>:  +- SubqueryAlias table
>: +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
> offset#32L, timestamp#33, timestampType#34]
>+- SubqueryAlias y
>   +- SubqueryAlias table
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
> offset#32L, timestamp#33, timestampType#34]
>
> any idea whats wrong here?
>
> Thanks!
>
>
>
>
>
>
>


Job never finishing

2018-02-20 Thread Nikhil Goyal
Hi guys,

I have a job which gets stuck if a couple of tasks get killed due to OOM
exception. Spark doesn't kill the job and it keeps on running for hours.
Ideally I would expect Spark to kill the job or restart the killed
executors but nothing seems to be happening. Anybody got idea about this?

Thanks
Nikhil


Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina,

Even text won't work you may try this  df.coalesce(1).write.option("h
eader","true").mode("overwrite").save("output",format=text)
Else convert to an rdd and use saveAsTextFile

Regards,
Snehasish

On Wed, Feb 21, 2018 at 3:38 AM, SNEHASISH DUTTA 
wrote:

> Hi Mina,
> This might work then
>
> df.coalesce(1).write.option("header","true").mode("overwrite
> ").text("output")
>
> Regards,
> Snehasish
>
> On Wed, Feb 21, 2018 at 3:21 AM, Mina Aslani  wrote:
>
>> Hi Snehasish,
>>
>> Using df.coalesce(1).write.option("header","true").mode("overwrite
>> ").csv("output") throws
>>
>> java.lang.UnsupportedOperationException: CSV data source does not
>> support struct<...> data type.
>>
>>
>> Regards,
>> Mina
>>
>>
>>
>>
>> On Tue, Feb 20, 2018 at 4:36 PM, SNEHASISH DUTTA <
>> info.snehas...@gmail.com> wrote:
>>
>>> Hi Mina,
>>> This might help
>>> df.coalesce(1).write.option("header","true").mode("overwrite
>>> ").csv("output")
>>>
>>> Regards,
>>> Snehasish
>>>
>>> On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani 
>>> wrote:
>>>
 Hi,

 I would like to serialize a dataframe with vector values into a
 text/csv in pyspark.

 Using below line, I can write the dataframe(e.g. df) as parquet,
 however I cannot open it in excel/as text.
 df.coalesce(1).write.option("header","true").mode("overwrite
 ").save("output")

 Best regards,
 Mina


>>>
>>
>


Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread vermanurag
If your dataframe has columns types like vector then you cannot save as csv/
text as there are no direct equivalent supported by flat formats like csv/
text. You may need to convert the column type appropriately (eg. convert the
incompatible column to StringType before saving the output as csv. You may
need to write a UDF to convert the column Type.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi Snehasish,

Unfortunately, none of the solutions worked.

Regards,
Mina

On Tue, Feb 20, 2018 at 5:12 PM, SNEHASISH DUTTA 
wrote:

> Hi Mina,
>
> Even text won't work you may try this  df.coalesce(1).write.option("h
> eader","true").mode("overwrite").save("output",format=text)
> Else convert to an rdd and use saveAsTextFile
>
> Regards,
> Snehasish
>
> On Wed, Feb 21, 2018 at 3:38 AM, SNEHASISH DUTTA  > wrote:
>
>> Hi Mina,
>> This might work then
>>
>> df.coalesce(1).write.option("header","true").mode("overwrite
>> ").text("output")
>>
>> Regards,
>> Snehasish
>>
>> On Wed, Feb 21, 2018 at 3:21 AM, Mina Aslani 
>> wrote:
>>
>>> Hi Snehasish,
>>>
>>> Using df.coalesce(1).write.option("header","true").mode("overwrite
>>> ").csv("output") throws
>>>
>>> java.lang.UnsupportedOperationException: CSV data source does not
>>> support struct<...> data type.
>>>
>>>
>>> Regards,
>>> Mina
>>>
>>>
>>>
>>>
>>> On Tue, Feb 20, 2018 at 4:36 PM, SNEHASISH DUTTA <
>>> info.snehas...@gmail.com> wrote:
>>>
 Hi Mina,
 This might help
 df.coalesce(1).write.option("header","true").mode("overwrite
 ").csv("output")

 Regards,
 Snehasish

 On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani 
 wrote:

> Hi,
>
> I would like to serialize a dataframe with vector values into a
> text/csv in pyspark.
>
> Using below line, I can write the dataframe(e.g. df) as parquet,
> however I cannot open it in excel/as text.
> df.coalesce(1).write.option("header","true").mode("overwrite
> ").save("output")
>
> Best regards,
> Mina
>
>

>>>
>>
>


Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
if I change it to the below code it works. However, I don't believe it is
the solution I am looking for. I want to be able to do it in raw SQL and
moreover, If a user gives a big chained raw spark SQL join query I am not
even sure how to make copies of the dataframe to achieve the self-join. Is
there any other way here?



import org.apache.spark.sql.streaming.Trigger

val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe",
"join_test").option("startingOffsets", "earliest").load();
val jdf1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe",
"join_test").option("startingOffsets", "earliest").load();

jdf.createOrReplaceTempView("table")
jdf1.createOrReplaceTempView("table")

val resultdf = spark.sql("select * from table inner join table1 on
table.offset=table1.offset")

resultdf.writeStream.outputMode("append").format("console").option("truncate",
false).trigger(Trigger.ProcessingTime(1000)).start()


On Tue, Feb 20, 2018 at 8:16 PM, kant kodali  wrote:

> If I change it to this
>
>
>
>
> On Tue, Feb 20, 2018 at 7:52 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I have the following code
>>
>> import org.apache.spark.sql.streaming.Trigger
>>
>> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
>> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
>> "earliest").load();
>>
>> jdf.createOrReplaceTempView("table")
>>
>> val resultdf = spark.sql("select * from table as x inner join table as y on 
>> x.offset=y.offset")
>>
>> resultdf.writeStream.outputMode("update").format("console").option("truncate",
>>  false).trigger(Trigger.ProcessingTime(1000)).start()
>>
>> and I get the following exception.
>>
>> org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given 
>> input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, 
>> x.timestamp, x.partition]; line 1 pos 50;
>> 'Project [*]
>> +- 'Join Inner, ('x.offset = 'y.offset)
>>:- SubqueryAlias x
>>:  +- SubqueryAlias table
>>: +- StreamingRelation 
>> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
>> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
>> offset#32L, timestamp#33, timestampType#34]
>>+- SubqueryAlias y
>>   +- SubqueryAlias table
>>  +- StreamingRelation 
>> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
>> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
>> offset#32L, timestamp#33, timestampType#34]
>>
>> any idea whats wrong here?
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>


Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung
No it does not support bi directional edges as of now.

_
From: xiaobo 
Sent: Tuesday, February 20, 2018 4:35 AM
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships
To: Felix Cheung , 


So the question comes to does graphframes support bidirectional relationship 
natively with only one edge?



-- Original --
From: Felix Cheung 
Date: Tue,Feb 20,2018 10:01 AM
To: xiaobo , user@spark.apache.org 
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.


From: xiaobo 
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks





Re: Can spark handle this scenario?

2018-02-20 Thread Lian Jiang
Thanks Vijay! This is very clear.

On Tue, Feb 20, 2018 at 12:47 AM, vijay.bvp  wrote:

> I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API
> with some token passed, in the code provided so far if you have 2000
> symbols, it will make 2000 new connections!! and 2000 API calls
> connection objects can't/shouldn't be serialized and send to executors,
> they
> should rather be created at executors.
>
> the philosophy given below is nicely documented on Spark Streaming, look at
> Design Patterns for using foreachRDD
> https://spark.apache.org/docs/latest/streaming-programming-
> guide.html#output-operations-on-dstreams
>
>
> case class Symbol(symbol: String, sector: String)
> case class Tick(symbol: String, sector: String, open: Double, close:
> Double)
> //assume symbolDs is rdd of symbol and tick dataset/dataframe can be
> converted to RDD
> symbolRdd.foreachPartition(partition => {
>//this code runs at executor
>   //open a connection here -
>   val connectionToYahoo = new HTTPConnection()
>
>   partition.foreach(k => {
>   pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo)
>   }
> }
>
> with the above code if the dataset has 10 partitions (2000 symbols), only
> 10
> connections will be opened though it makes 2000 API calls.
> you should also be looking at sending and receiving results for large
> number
> of symbols, because of the amount of parallelism that spark provides you
> might run into rate limit on the APIs. if you are bulk sending symbols
> above
> pattern also very much useful
>
> thanks
> Vijay
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
Hi All,

I have the following code

import org.apache.spark.sql.streaming.Trigger

val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe",
"join_test").option("startingOffsets", "earliest").load();

jdf.createOrReplaceTempView("table")

val resultdf = spark.sql("select * from table as x inner join table as
y on x.offset=y.offset")

resultdf.writeStream.outputMode("update").format("console").option("truncate",
false).trigger(Trigger.ProcessingTime(1000)).start()

and I get the following exception.

org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`'
given input columns: [x.value, x.offset, x.key, x.timestampType,
x.topic, x.timestamp, x.partition]; line 1 pos 50;
'Project [*]
+- 'Join Inner, ('x.offset = 'y.offset)
   :- SubqueryAlias x
   :  +- SubqueryAlias table
   : +- StreamingRelation
DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
-> earliest, subscribe -> join_test, kafka.bootstrap.servers ->
localhost:9092),None), kafka, [key#28, value#29, topic#30,
partition#31, offset#32L, timestamp#33, timestampType#34]
   +- SubqueryAlias y
  +- SubqueryAlias table
 +- StreamingRelation
DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
-> earliest, subscribe -> join_test, kafka.bootstrap.servers ->
localhost:9092),None), kafka, [key#28, value#29, topic#30,
partition#31, offset#32L, timestamp#33, timestampType#34]

any idea whats wrong here?

Thanks!


Re: Job never finishing

2018-02-20 Thread Femi Anthony
You can use spark speculation as a way to get around the problem.

Here is a useful link:

http://asyncified.io/2016/08/13/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/

Sent from my iPhone

> On Feb 20, 2018, at 5:52 PM, Nikhil Goyal  wrote:
> 
> Hi guys,
> 
> I have a job which gets stuck if a couple of tasks get killed due to OOM 
> exception. Spark doesn't kill the job and it keeps on running for hours. 
> Ideally I would expect Spark to kill the job or restart the killed executors 
> but nothing seems to be happening. Anybody got idea about this?
> 
> Thanks
> Nikhil


Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi,

I was hoping that there is a casting vector into String method (instead of
writing my UDF), so that it can then be serialized it into csv/text file.

Best regards,
Mina

On Tue, Feb 20, 2018 at 6:52 PM, vermanurag 
wrote:

> If your dataframe has columns types like vector then you cannot save as
> csv/
> text as there are no direct equivalent supported by flat formats like csv/
> text. You may need to convert the column type appropriately (eg. convert
> the
> incompatible column to StringType before saving the output as csv. You may
> need to write a UDF to convert the column Type.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: sqoop import job not working when spark thrift server is running.

2018-02-20 Thread akshay naidu
hello vijay,
appreciate your reply.

  what was the error when you are trying to run mapreduce import job when
> the
> thrift server is running.


it didnt throw any error, it just gets stuck at
INFO mapreduce.Job: Running job: job_151911053

and resumes the moment i kill Thrift .

thanks

On Tue, Feb 20, 2018 at 1:48 PM, vijay.bvp  wrote:

> what was the error when you are trying to run mapreduce import job when the
> thrift server is running.
> this is only config changed? what was the config before...
> also share the spark thrift server job config such as no of executors,
> cores
> memory etc.
>
> My guess is your mapreduce job is unable to get sufficient resources,
> container couldn't be launched and so failing to start, this could either
> because of non availability sufficient cores or RAM
>
> 9 worker nodes 12GB RAM each with 6 cores (max allowed cores 4 per
> container)
> you have to keep some room for operation system and other daemons.
>
> if thrift server is setup to have 11 executors with 3 cores each = 33 cores
> for workers and 1 for driver so 34 cores required for spark job and rest
> for
> any other jobs.
>
> spark driver and worker memory is ~9GB
> with 9 12 GB RAM worker nodes not sure how much you can allocate.
>
> thanks
> Vijay
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Can spark handle this scenario?

2018-02-20 Thread vijay.bvp
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API
with some token passed, in the code provided so far if you have 2000
symbols, it will make 2000 new connections!! and 2000 API calls
connection objects can't/shouldn't be serialized and send to executors, they
should rather be created at executors. 

the philosophy given below is nicely documented on Spark Streaming, look at
Design Patterns for using foreachRDD
https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams


case class Symbol(symbol: String, sector: String)
case class Tick(symbol: String, sector: String, open: Double, close: Double)
//assume symbolDs is rdd of symbol and tick dataset/dataframe can be
converted to RDD
symbolRdd.foreachPartition(partition => {
   //this code runs at executor
  //open a connection here - 
  val connectionToYahoo = new HTTPConnection()

  partition.foreach(k => {
  pullSymbolFromYahoo(k.symbol, k.sector,connectionToYahoo)
  }
}

with the above code if the dataset has 10 partitions (2000 symbols), only 10
connections will be opened though it makes 2000 API calls. 
you should also be looking at sending and receiving results for large number
of symbols, because of the amount of parallelism that spark provides you
might run into rate limit on the APIs. if you are bulk sending symbols above
pattern also very much useful

thanks
Vijay







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: sqoop import job not working when spark thrift server is running.

2018-02-20 Thread vijay.bvp
what was the error when you are trying to run mapreduce import job when the
thrift server is running.
this is only config changed? what was the config before... 
also share the spark thrift server job config such as no of executors, cores
memory etc.

My guess is your mapreduce job is unable to get sufficient resources,
container couldn't be launched and so failing to start, this could either
because of non availability sufficient cores or RAM

9 worker nodes 12GB RAM each with 6 cores (max allowed cores 4 per
container)
you have to keep some room for operation system and other daemons. 

if thrift server is setup to have 11 executors with 3 cores each = 33 cores
for workers and 1 for driver so 34 cores required for spark job and rest for
any other jobs. 

spark driver and worker memory is ~9GB 
with 9 12 GB RAM worker nodes not sure how much you can allocate.

thanks
Vijay



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-20 Thread Ramon Bejar
But, is it not possible to compute with both directions of an edge like 
it happens with graphX ?



On 02/20/2018 03:01 AM, Felix Cheung wrote:

Generally that would be the approach.
But since you have effectively double the number of edges this will 
likely affect the scale your job will run.



*From:* xiaobo 
*Sent:* Monday, February 19, 2018 3:22:02 AM
*To:* user@spark.apache.org
*Subject:* [graphframes]how Graphframes Deal With Bidirectional 
Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert 
two edges for the vertices pair, my question is do the algorithms of 
graphframes still work when we doing this.


Thanks





Re: The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Sorry. please ignore. it works now!

On Tue, Feb 20, 2018 at 5:41 AM, kant kodali  wrote:

> Hi All,
>
> I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I
> can see my Dataframe has a schema like below. The timestamp column seems to
> be same for every record and I am not sure why? am I missing something (did
> I fail to configure something)?
>
> Thanks!
>
>
>
> Column Type
> key binary
> value binary
> topic string
> partition int
> offset long
> timestamp long
> timestampType int
>


Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-20 Thread LongVehicle
Hi Vijay,

Thanks for the follow-up.

The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS
read stage) is because we load the same HDFS data in different jobs, and
these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven
assignment problem that we had before could not be explained by modulo
operation/remainder, because we sometimes had only 2 executors active out of
9 (while the remaining 7 would stay completely idle).

We tried to repartition the Kafka stream to 90 partitions, but it led to
even worse disbalance in the load. Seems that keeping the number of
partitions equal to executors X cores reduces the chance of uneven
assignment.

We also tried to repartition the HDFS data to 9 partitions, but it did not
help, because repartition takes into account the initial locality of data,
so 9 partitions may end up on 9 different cores. We also tried to set
spark.shuffle.reduceLocality.enabled=false, but it did not help. Last but
not least, we want to avoid coleasce, because then partitions would depend
on the HDFS block distribution, so they would not be hash partitioned (which
we need for the join).

Please find below the relevant UI snapshots:


 
 

The snapshots refers to the batch when RDD is reloaded (WholeStageCodegen
1147 is gray except in the batch at reload time, which happens every 30
minutes).

Thanks a lot!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Hi All,

I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I
can see my Dataframe has a schema like below. The timestamp column seems to
be same for every record and I am not sure why? am I missing something (did
I fail to configure something)?

Thanks!



Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int


Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina,
This might help
df.coalesce(1).write.option("header","true").mode("overwrite").csv("output")

Regards,
Snehasish

On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani  wrote:

> Hi,
>
> I would like to serialize a dataframe with vector values into a text/csv
> in pyspark.
>
> Using below line, I can write the dataframe(e.g. df) as parquet, however I
> cannot open it in excel/as text.
> df.coalesce(1).write.option("header","true").mode("overwrite
> ").save("output")
>
> Best regards,
> Mina
>
>


Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi Snehasish,

Using df.coalesce(1).write.option("header","true").mode("overwrite
").csv("output") throws

java.lang.UnsupportedOperationException: CSV data source does not support
struct<...> data type.


Regards,
Mina




On Tue, Feb 20, 2018 at 4:36 PM, SNEHASISH DUTTA 
wrote:

> Hi Mina,
> This might help
> df.coalesce(1).write.option("header","true").mode("overwrite
> ").csv("output")
>
> Regards,
> Snehasish
>
> On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani  wrote:
>
>> Hi,
>>
>> I would like to serialize a dataframe with vector values into a text/csv
>> in pyspark.
>>
>> Using below line, I can write the dataframe(e.g. df) as parquet, however
>> I cannot open it in excel/as text.
>> df.coalesce(1).write.option("header","true").mode("overwrite
>> ").save("output")
>>
>> Best regards,
>> Mina
>>
>>
>


Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this message because you’re subscribed to a user@ or 
dev@ list of one or more Apache Software Foundation projects.)


We’re pleased to announce the upcoming ApacheCon [1] in Montréal, 
September 24-27. This event is all about you — the Apache project community.


We’ll have four tracks of technical content this time, as well as lots 
of opportunities to connect with your project community, hack on the 
code, and learn about other related (and unrelated!) projects across the 
foundation.


The Call For Papers (CFP) [2] and registration are now open. Register 
early to take advantage of the early bird prices and secure your place 
at the event hotel.


Important dates
March 30: CFP closes
April 20: CFP notifications sent
	August 24: Hotel room block closes (please do not wait until the last 
minute)


Follow @ApacheCon on Twitter to be the first to hear announcements about 
keynotes, the schedule, evening events, and everything you can expect to 
see at the event.


See you in Montréal!

Sincerely, Rich Bowen, V.P. Events,
on behalf of the entire ApacheCon team

[1] http://www.apachecon.com/acna18
[2] https://cfp.apachecon.com/conference.html?apachecon-north-america-2018

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi,

I would like to serialize a dataframe with vector values into a text/csv in
pyspark.

Using below line, I can write the dataframe(e.g. df) as parquet, however I
cannot open it in excel/as text.
df.coalesce(1).write.option("header","true").mode("
overwrite").save("output")

Best regards,
Mina


Write a DataFrame with Vector values into text/csv file

2018-02-20 Thread Mina Aslani
Hi,

I would like to write a dataframe with vactor values into a text/csv file.

Using below line, I can write it as parquet, however I cannot open it in
excel/as text.
df.coalesce(1).write.option("header","true").mode("overwrite").save("stage-s3logs-model")

Wondering how to save the result of a MLib transformation function(e.g.
oneHotEncoder) which generates vectors into a file.

Best regards,
Mina