Re: Spark inserting into parquet files with different schema

2015-08-08 Thread sim
Adam, did you find a solution for this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-inserting-into-parquet-files-with-different-schema-tp20706p24181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Schema change on Spark Hive (Parquet file format) table not working

2015-08-08 Thread sim
Yes, I've found a number of problems with metadata management in Spark SQL. 

One core issue is  SPARK-9764
  . Related issues are 
SPARK-9342   ,  SPARK-9761
   and  SPARK-9762
  .

I've also observed a case where, after an exception in ALTER TABLE, Spark
SQL thought a table had 0 rows while, in fact, all the data was still there.
I was not able to reproduce this one reliably so I did not create a JIRA
issue for it.

Let's vote for these issues and get them resolved.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p24180.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or
Hi Saif,

You need to run your application with `spark.eventLog.enabled` set to true.
Then if you are using standalone mode, you can view the Master UI at port
8080. Otherwise, you may start a history server through
`sbin/start-history-server.sh`, which by default starts the history UI at
port 18080.

For more information on how to set this up, visit:
http://spark.apache.org/docs/latest/monitoring.html

-Andrew


2015-08-07 13:16 GMT-07:00 François Pelletier <
newslett...@francoispelletier.org>:

>
> look at
> spark.history.ui.port, if you use standalone
> spark.yarn.historyServer.address, if you use YARN
>
> in your Spark config file
>
> Mine is located at
> /etc/spark/conf/spark-defaults.conf
>
> If you use Apache Ambari you can find this settings in the Spark / Configs
> / Advanced spark-defaults tab
>
> François
>
>
> Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :
>
> Hello, thank you, but that port is unreachable for me. Can you please
> share where can I find that port equivalent in my environment?
>
>
>
> Thank you
>
> Saif
>
>
>
> *From:* François Pelletier [mailto:newslett...@francoispelletier.org
> ]
> *Sent:* Friday, August 07, 2015 4:38 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Spark master driver UI: How to keep it after process
> finished?
>
>
>
> Hi, all spark processes are saved in the Spark History Server
>
> look at your host on port 18080 instead of 4040
>
> François
>
> Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
>
> Hi,
>
>
>
> A silly question here. The Driver Web UI dies when the spark-submit
> program finish. I would like some time to analyze after the program ends,
> as the page does not refresh it self, when I hit F5 I lose all the info.
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>
>
>


How to create DataFrame from a binary file?

2015-08-08 Thread unk1102
Hi how do we create DataFrame from a binary file stored in HDFS? I was
thinking to use

JavaPairRDD pairRdd =
javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
JavaRDD javardd = pairRdd.values();

I can see that PortableDataStream has method called toArray which can
convert into byte array I was thinking if I have JavaRDD can I call
the following and get DataFrame

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);

Please guide I am new to Spark. I have my own custom format which is binary
format and I was thinking if I can convert my custom format into DataFrame
using binary operations then I dont need to create my own custom Hadoop
format am I on right track? Will reading binary data into DataFrame scale?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark sql jobs n their partition

2015-08-08 Thread Raghavendra Pandey
I have a complex transformation requirements that i m implementing using
dataframe.  It involves lot of joins also with Cassandra table.
I was wondering how can I debug the jobs n stages queued by spark sql the
way I can do for Rdds.

In one of cases, spark sql creates more than 17 lakhs tasks for 2gb data..
I have set sql partition@32.

Raghav


Re: DataFrame column structure change

2015-08-08 Thread Raghavendra Pandey
You can use struct function of org.apache.spark.sql.function class to
combine two columns to create struct column.
Sth like.
val nestedCol = struct(df("d"), df("e"))
df.select(df(a), df(b), df(c), nestedCol)
On Aug 7, 2015 3:14 PM, "Rishabh Bhardwaj"  wrote:

> I am doing it by creating a new data frame out of the fields to be nested
> and then join with the original DF.
> Looking for some optimized solution here.
>
> On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj 
> wrote:
>
>> Hi all,
>>
>> I want to have some nesting structure from the existing columns of
>> the dataframe.
>> For that,,I am trying to transform a DF in the following way,but couldn't
>> do it.
>>
>> scala> df.printSchema
>> root
>>  |-- a: string (nullable = true)
>>  |-- b: string (nullable = true)
>>  |-- c: string (nullable = true)
>>  |-- d: string (nullable = true)
>>  |-- e: string (nullable = true)
>>  |-- f: string (nullable = true)
>>
>> *To*
>>
>> scala> newDF.printSchema
>> root
>>  |-- a: string (nullable = true)
>>  |-- b: string (nullable = true)
>>  |-- c: string (nullable = true)
>>  |-- newCol: struct (nullable = true)
>>  ||-- d: string (nullable = true)
>>  ||-- e: string (nullable = true)
>>
>>
>> help me.
>>
>> Regards,
>> Rishabh.
>>
>
>


Re: Spark on YARN

2015-08-08 Thread Shushant Arora
which is the scheduler on your cluster. Just check on RM UI scheduler tab
and see your user and max limit of vcores for that user , is currently
other applications of that user have occupies till max vcores of this user
then that could be the reason of not allocating vcores to this user but for
some other user  same applicatin is getting run since another user's max
vcore limit is not reached.

On Sat, Aug 8, 2015 at 10:07 PM, Jem Tucker  wrote:

> Hi dustin,
>
> Yes there are enough resources available, the same application run with a
> different user works fine so I think it is something to do with permissions
> but I can't work out where.
>
> Thanks ,
>
> Jem
>
> On Sat, 8 Aug 2015 at 17:35, Dustin Cote  wrote:
>
>> Hi Jem,
>>
>> In the top of the RM web UI, do you see any available resources to spawn
>> the application master container?
>>
>>
>> On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker  wrote:
>>
>>> Hi Sandy,
>>>
>>> The application doesn't fail, it gets accepted by yarn but the
>>> application master never starts and the application state never changes to
>>> running. I have checked in the resource manager and node manager logs and
>>> nothing jumps out.
>>>
>>> Thanks
>>>
>>> Jem
>>
>>
>>> On Sat, 8 Aug 2015 at 09:20, Sandy Ryza  wrote:
>>>
 Hi Jem,

 Do they fail with any particular exception?  Does YARN just never end
 up giving them resources?  Does an application master start?  If so, what
 are in its logs?  If not, anything suspicious in the YARN ResourceManager
 logs?

 -Sandy

 On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker 
 wrote:

> Hi,
>
> I am running spark on YARN on the CDH5.3.2 stack. I have created a new
> user to own and run a testing environment, however when using this user
> applications I submit to yarn never begin to run, even if they are the
> exact same application that is successful with another user?
>
> Has anyone seen anything like this before?
>
> Thanks,
>
> Jem
>

 --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "CDH Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to cdh-user+unsubscr...@cloudera.org.
>>> For more options, visit
>>> https://groups.google.com/a/cloudera.org/d/optout.
>>>
>>
>>
>>
>> --
>> Dustin Cote
>> Customer Operations Engineer
>> 
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "CDH Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to cdh-user+unsubscr...@cloudera.org.
>> For more options, visit https://groups.google.com/a/cloudera.org/d/optout
>> .
>>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.
>


Re: java.lang.ClassNotFoundException

2015-08-08 Thread Yasemin Kaya
Thanx Ted, i solved it :)

2015-08-08 14:07 GMT+03:00 Ted Yu :

> Have you tried including package name in the class name ?
>
> Thanks
>
>
>
> On Aug 8, 2015, at 12:00 AM, Yasemin Kaya  wrote:
>
> Hi,
>
> I have a little spark program and i am getting an error why i dont
> understand.
> My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
> I am using spark 1.3
> Submitting : bin/spark-submit --class MonthlyAverage --master local[4]
> weather.jar
>
>
> error:
>
> ~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage
> --master local[4] weather.jar
> java.lang.ClassNotFoundException: MonthlyAverage
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
>
>
> Please help me Asap..
>
> yasemin
> --
> hiç ender hiç
>
>


-- 
hiç ender hiç


Re: Spark on YARN

2015-08-08 Thread Jem Tucker
Hi dustin,

Yes there are enough resources available, the same application run with a
different user works fine so I think it is something to do with permissions
but I can't work out where.

Thanks ,

Jem
On Sat, 8 Aug 2015 at 17:35, Dustin Cote  wrote:

> Hi Jem,
>
> In the top of the RM web UI, do you see any available resources to spawn
> the application master container?
>
>
> On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker  wrote:
>
>> Hi Sandy,
>>
>> The application doesn't fail, it gets accepted by yarn but the
>> application master never starts and the application state never changes to
>> running. I have checked in the resource manager and node manager logs and
>> nothing jumps out.
>>
>> Thanks
>>
>> Jem
>
>
>> On Sat, 8 Aug 2015 at 09:20, Sandy Ryza  wrote:
>>
>>> Hi Jem,
>>>
>>> Do they fail with any particular exception?  Does YARN just never end up
>>> giving them resources?  Does an application master start?  If so, what are
>>> in its logs?  If not, anything suspicious in the YARN ResourceManager logs?
>>>
>>> -Sandy
>>>
>>> On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker  wrote:
>>>
 Hi,

 I am running spark on YARN on the CDH5.3.2 stack. I have created a new
 user to own and run a testing environment, however when using this user
 applications I submit to yarn never begin to run, even if they are the
 exact same application that is successful with another user?

 Has anyone seen anything like this before?

 Thanks,

 Jem

>>>
>>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "CDH Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to cdh-user+unsubscr...@cloudera.org.
>> For more options, visit https://groups.google.com/a/cloudera.org/d/optout
>> .
>>
>
>
>
> --
> Dustin Cote
> Customer Operations Engineer
> 
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.
>


Re: java.lang.ClassNotFoundException

2015-08-08 Thread Ted Yu
Have you tried including package name in the class name ?

Thanks



> On Aug 8, 2015, at 12:00 AM, Yasemin Kaya  wrote:
> 
> Hi,
> 
> I have a little spark program and i am getting an error why i dont 
> understand. 
> My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
> I am using spark 1.3 
> Submitting : bin/spark-submit --class MonthlyAverage --master local[4] 
> weather.jar
> 
> 
> error: 
> 
> ~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage --master 
> local[4] weather.jar
> java.lang.ClassNotFoundException: MonthlyAverage
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:274)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 
> 
> Please help me Asap..
> 
> yasemin
> -- 
> hiç ender hiç


Pagination on big table, splitting joins

2015-08-08 Thread Gaspar Muñoz
Hi,

I have two different parts in my system.

1. Batch application that every x minutes do sql queries between several
tables that contains millions of rows to compound a entity, and sent that
entities to Kafka.
2. Streaming application that processing data from Kafka.

Now, I have entire system working, but I want to improve the performance in
the batch part, because if I have 100 millions of entities I send them to
Kafka in a foreach method in a row, which makes no sense for the next
streaming application. I want, send each 10 millions events to Kafka, for
example.

I have a query, imagine

*select ... from table 1 left outer join table 2 on ... left outer join
table 3 on ... left outer join table 4 on ...*

My target is do *pagination* on table 1 and take 10 million in a separate
RDD, do the joins and send to Kafka,  then take another 10 million and do
the same... I have all tables in parquet format in hdfs.

I think to use *toLocalIterator* method and something like that, but I have
doubts about memory and parallelism and sure there is a better way to do it.

rdd.toLocalIterator.grouped(1000).foreach( seq =>

val rdd: RDD[(String, Int)] = sc.parallelize(seq)
 // Do the processing

)

What do you think?

Regards.

-- 

Gaspar Muñoz
@gmunozsoria



Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd *


Re: Spark on YARN

2015-08-08 Thread Jem Tucker
Hi Sandy,

The application doesn't fail, it gets accepted by yarn but the application
master never starts and the application state never changes to running. I
have checked in the resource manager and node manager logs and nothing
jumps out.

Thanks

Jem
On Sat, 8 Aug 2015 at 09:20, Sandy Ryza  wrote:

> Hi Jem,
>
> Do they fail with any particular exception?  Does YARN just never end up
> giving them resources?  Does an application master start?  If so, what are
> in its logs?  If not, anything suspicious in the YARN ResourceManager logs?
>
> -Sandy
>
> On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker  wrote:
>
>> Hi,
>>
>> I am running spark on YARN on the CDH5.3.2 stack. I have created a new
>> user to own and run a testing environment, however when using this user
>> applications I submit to yarn never begin to run, even if they are the
>> exact same application that is successful with another user?
>>
>> Has anyone seen anything like this before?
>>
>> Thanks,
>>
>> Jem
>>
>
>


Re: Spark on YARN

2015-08-08 Thread Sandy Ryza
Hi Jem,

Do they fail with any particular exception?  Does YARN just never end up
giving them resources?  Does an application master start?  If so, what are
in its logs?  If not, anything suspicious in the YARN ResourceManager logs?

-Sandy

On Fri, Aug 7, 2015 at 1:48 AM, Jem Tucker  wrote:

> Hi,
>
> I am running spark on YARN on the CDH5.3.2 stack. I have created a new
> user to own and run a testing environment, however when using this user
> applications I submit to yarn never begin to run, even if they are the
> exact same application that is successful with another user?
>
> Has anyone seen anything like this before?
>
> Thanks,
>
> Jem
>


java.lang.ClassNotFoundException

2015-08-08 Thread Yasemin Kaya
Hi,

I have a little spark program and i am getting an error why i dont
understand.
My code is https://gist.github.com/yaseminn/522a75b863ad78934bc3.
I am using spark 1.3
Submitting : bin/spark-submit --class MonthlyAverage --master local[4]
weather.jar


error:

~/spark-1.3.1-bin-hadoop2.4$ bin/spark-submit --class MonthlyAverage
--master local[4] weather.jar
java.lang.ClassNotFoundException: MonthlyAverage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties


Please help me Asap..

yasemin
-- 
hiç ender hiç