Re: Could any one please tell me why this takes forever to finish?

2017-05-01 Thread Yan Facai
Hi,
10.x.x.x is private network, see https://en.wikipedia.org/wiki/IP_address.
You should use the public IP of your AWS.

On Sat, Apr 29, 2017 at 6:35 AM, Yuan Fang 
wrote:

>
> object SparkPi {
>   private val logger = Logger(this.getClass)
>
>   val sparkConf = new SparkConf()
> .setAppName("Spark Pi")
> .setMaster("spark://10.100.103.192:7077")
>
>   lazy val sc = new SparkContext(sparkConf)
>   sc.addJar("/Users/yfang/workspace/mcs/target/scala-2.11/
> root-assembly-0.1.0.jar")
>
>   def main(args: Array[String]) {
> val x = (1 to 4)
> val a = sc.parallelize(x)
> val mean = a.mean()
> print(mean)
>   }
> }
>
>
> spark://10.100.103.192:7077 is a remote standalone cluster I created on
> AWS.
> I ran it locally with IntelliJ.
> I can see the job is submitted. But the calculation can never finish.
>
> The log shows:
> 15:34:21.674 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl
> - Initial job has not accepted any resources; check your cluster UI to
> ensure that workers are registered and have sufficient resources
>
> Any help will be highly appreciated!
>
> Thanks!
>
> Yuan
>
>
> This message is intended exclusively for the individual or entity to which
> it is addressed. This communication may contain information that is
> proprietary, privileged or confidential or otherwise legally prohibited
> from disclosure. If you are not the named addressee, you are not authorized
> to read, print, retain, copy or disseminate this message or any part of it.
> If you have received this message in error, please notify the sender
> immediately by e-mail and delete all copies of the message.


Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim,

Spark ML API doesn't support set initial model for GMM currently. I wish we
can get this feature in Spark 2.3.

Thanks
Yanbo

On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith  wrote:

> Hi,
>
> I am trying to figure out the API to initialize a gaussian mixture model
> using either centroids created by K-means or previously calculated GMM
> model (I am aware that you can "save" a model and "load" in later but I am
> not interested in saving a model to a filesystem).
>
> The Spark MLlib API lets you do this using SetInitialModel
> https://spark.apache.org/docs/2.1.0/api/scala/index.html#
> org.apache.spark.mllib.clustering.GaussianMixture
>
> However, I cannot figure out how to do this using Spark ML API. Can anyone
> please point me in the right direction? I've tried reading the Spark ML
> code and was wondering if the "set" call lets you do that?
>
> --
> Thanks,
>
> Tim
>


RE: Spark-SQL Query Optimization: overlapping ranges

2017-05-01 Thread Lavelle, Shawn
Jacek,

   Thanks for your help.  I didn’t want to write a bug/enhancement unless 
warranted.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Thursday, April 27, 2017 8:39 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

Hi Shawn,

If you're asking me if Spark SQL should optimize such queries, I don't know.

If you're asking me if it's possible to convince Spark SQL to do so, I'd say, 
sure, it is. Write your optimization rule and attach it to Optimizer (using 
extraOptimizations extension point).


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Thu, Apr 27, 2017 at 3:22 PM, Lavelle, Shawn 
> wrote:
Hi Jacek,

 I know that it is not currently doing so, but should it be?  The algorithm 
isn’t complicated and could be applied to both OR and AND logical operators 
with comparison operators as children.
 My users write programs to generate queries that aren’t checked for this 
sort of thing. We’re probably going to write our own 
org.apache.spark.sql.catalyst.rules.Rule to handle it.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Wednesday, April 26, 2017 2:55 AM
To: Lavelle, Shawn >
Cc: user >
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

explain it and you'll know what happens under the covers.

i.e. Use explain on the Dataset.

Jacek

On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" 
> wrote:
Hello Spark Users!

   Does the Spark Optimization engine reduce overlapping column ranges?  If so, 
should it push this down to a Data Source?

  Example,
This:  Select * from table where col between 3 and 7 OR col between 5 and 9
Reduces to:  Select * from table where col between 3 and 9


  Thanks for your insight!

~ Shawn M Lavelle



[cid:image001.png@01D2C298.8343C580]
Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com





Loading postgresql table to spark SyntaxError

2017-05-01 Thread Saulo Ricci
Hi,

the following code is reading a table from my postgresql database, and I'm
following the directives I've read on the internet:

val txs = spark.read.format("jdbc").options(Map(
("driver" -> "org.postgresql.Driver"),
("url" -> "jdbc:postgresql://host/dbname"),
("dbtable" -> "(select field1 from t) as table1"),
("user" -> "username"),
("password" -> "password"))).load()

I'm running this query on a jupyter notebook but I'm getting the SyntaxError
as message but no clue about the stacktrace.

I'm assuming that query was right as I saw similar queries working with
other people.

Any help is appreciate to try to understand how to fix this issue.

-- 
Saulo


Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
Oh, and if you want a default other than null:

import org.apache.spark.sql.functions._
df.withColumn("address", coalesce($"address", lit())

On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust 
wrote:

> The following should work:
>
> val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema
> spark.read.schema(schema).parquet("data.parquet").as[Course]
>
> Note this will only work for nullable files (i.e. if you add a primitive
> like Int you need to make it an Option[Int])
>
> On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler <
> rotationsymmetr...@gmail.com> wrote:
>
>> Hi Spark Users,
>>
>> Suppose I have some data (stored in parquet for example) generated as
>> below:
>>
>> package com.company.entity.old
>> case class Course(id: Int, students: List[Student])
>> case class Student(name: String)
>>
>> Then usually I can access the data by
>>
>> spark.read.parquet("data.parquet").as[Course]
>>
>> Now I want to add a new field `address` to Student:
>>
>> package com.company.entity.new
>> case class Course(id: Int, students: List[Student])
>> case class Student(name: String, address: String)
>>
>> Then obviously running `spark.read.parquet("data.parquet").as[Course]`
>> on data generated by the old entity/schema will fail because `address`
>> is missing.
>>
>> In this case, what is the best practice to read data generated with
>> the old entity/schema to the new entity/schema, with the missing field
>> set to some default value? I know I can manually write a function to
>> do the transformation from the old to the new. But it is kind of
>> tedious. Any automatic methods?
>>
>> Thanks,
>>
>> Mike
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
The following should work:

val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema
spark.read.schema(schema).parquet("data.parquet").as[Course]

Note this will only work for nullable files (i.e. if you add a primitive
like Int you need to make it an Option[Int])

On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler 
wrote:

> Hi Spark Users,
>
> Suppose I have some data (stored in parquet for example) generated as
> below:
>
> package com.company.entity.old
> case class Course(id: Int, students: List[Student])
> case class Student(name: String)
>
> Then usually I can access the data by
>
> spark.read.parquet("data.parquet").as[Course]
>
> Now I want to add a new field `address` to Student:
>
> package com.company.entity.new
> case class Course(id: Int, students: List[Student])
> case class Student(name: String, address: String)
>
> Then obviously running `spark.read.parquet("data.parquet").as[Course]`
> on data generated by the old entity/schema will fail because `address`
> is missing.
>
> In this case, what is the best practice to read data generated with
> the old entity/schema to the new entity/schema, with the missing field
> set to some default value? I know I can manually write a function to
> do the transformation from the old to the new. But it is kind of
> tedious. Any automatic methods?
>
> Thanks,
>
> Mike
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread vincent gromakowski
Use cache or persist. The dataframe will be materialized when the 1st
action is called and then be reused from memory for each following usage

Le 1 mai 2017 4:51 PM, "Saulo Ricci"  a écrit :

> Hi,
>
>
> I have the following code that is reading a table to a apache spark
> DataFrame:
>
>  val df = spark.read.format("jdbc")
>  .option("url","jdbc:postgresql:host/database")
>  .option("dbtable","tablename").option("user","username")
>  .option("password", "password")
>  .load()
>
> When I first invoke df.count() I get a smaller number than the next time
> I invoke the same count method.
>
> Why this happen?
>
> Doesn't Spark load a snapshot of my table in a DataFrame on my Spark
> Cluster when I first read that table?
>
> My table on postgres keeps being fed and it seems my data frame is
> reflecting this behavior.
>
> How should I manage to load just a static snapshot my table to spark's
> DataFrame by the time `read` method was invoked?
>
>
> Any help is appreciated,
>
> --
> Saulo
>


Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread Saulo Ricci
Hi,


I have the following code that is reading a table to a apache spark
DataFrame:

 val df = spark.read.format("jdbc")
 .option("url","jdbc:postgresql:host/database")
 .option("dbtable","tablename").option("user","username")
 .option("password", "password")
 .load()

When I first invoke df.count() I get a smaller number than the next time I
invoke the same count method.

Why this happen?

Doesn't Spark load a snapshot of my table in a DataFrame on my Spark
Cluster when I first read that table?

My table on postgres keeps being fed and it seems my data frame is
reflecting this behavior.

How should I manage to load just a static snapshot my table to spark's
DataFrame by the time `read` method was invoked?


Any help is appreciated,

-- 
Saulo


Re: removing columns from file

2017-05-01 Thread Steve Loughran

On 28 Apr 2017, at 16:10, Anubhav Agarwal 
> wrote:

Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219


old/dated blog post.

If you get the Hadoop 2.8 binaries on your classpath, s3a does a full directory 
tree listing if you give it a simple path like "s3a://bucket/events". The 
example in that post was using a complex wildcard which hasn't yet been speeded 
up as it's pretty hard to do it in a way which works effectively everywhere.

Having all your data in 1 dir works nicely.


Anubhav

On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia 
> wrote:
Hi there,

I have a process that downloads thousands of files from s3 bucket, removes a 
set of columns from it, and upload it to s3.

S3 is currently not  the bottleneck, having a Single Master Node Spark instance 
is the bottleneck. One approach is to distribute the files on multiple Spark 
Master Node workers, that will make it faster.

yes, > 1 worker and, if the work can be partitioned


Question:

1.   Is there a way to utilize master / slave node on Spark to distribute 
this downloading and processing of files – so it can say do 10 files at a time?


yes, they are called RDDs/Dataframes & Datasets


If you are doing all the processing on the spark driver, then you aren't really 
using spark much, more just processing them in Scala

To get a dataframe

val df = SparkSession.read.format("csv").load("s3a://bucket/data")

You now have a dataset on all files in the directory /data in the bucket, which 
will be partitioned how spark decides (which depends on: # of workers, 
compression format used and its splittability). Assuming you can configure the 
dataframe with the column structure, you can filter aggressively by selecting 
only those columns you want

val filteredDf = df.select("rental", "start_time")
filteredDf.save(hdfs://final/processed")

then, once you've got all the data done, copy them up to S3 via distcp

I'd recommend you start doing this with a small number of files locally, 
getting the code working, then see if you can use it with s3 as the source/dest 
of data, again, locally if you want (it's just slow), then move to in-EC2 for 
the bandwidth.

Bandwidth wise, there are some pretty major performance issues with the s3n 
connector, S3a in Hadoop 2.7+ works, with Hadoop 2.8 having a lot more 
speedupm, especially when using orc and parquet as a source, where there's a 
special "random access mode".

futrher reading
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-spark/index.html

https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-performance/index.html


2.   Is there a way to scale workers with Spark downloading and processing 
files, even if they are all Single Master Node?



I think there may be some terminology confusion here. You are going to have to 
have one process which is the spark driver: either on your client machine, 
deployed somewhere in the cluster via YARN/Mesos, or running on a static 
location withing a spark standalone cluster. Everything other than the driver 
process is a work, which will do the work.