Re: How to control the number of files for dynamic partition in Spark SQL?

2016-01-30 Thread Deenar Toraskar
The following should work as long as your tables are created using Spark SQL

event_wk.repartition(2).write.partitionBy("eventDate").format("parquet"
).insertInto("event)

If you want to stick to using "insert overwrite" for Hive compatibility,
then you can repartition twice, instead of setting the global
spark.sql.shuffle.partition parameter

df eventwk = sqlContext.sql("some joins") // this should use the global
shuffle partition parameter
df eventwkRepartitioned = eventwk.repartition(2)
eventwkRepartitioned.registerTempTable("event_wk_repartitioned")
and use this in your insert statement.

registering temp table is cheap

HTH


On 29 January 2016 at 20:26, Benyi Wang  wrote:

> I want to insert into a partition table using dynamic partition, but I
> don’t want to have 200 files for a partition because the files will be
> small for my case.
>
> sqlContext.sql(  """
> |insert overwrite table event
> |partition(eventDate)
> |select
> | user,
> | detail,
> | eventDate
> |from event_wk
>   """.stripMargin)
>
> the table “event_wk” is created from a dataframe by registerTempTable,
> which is built with some joins. If I set spark.sql.shuffle.partition=2, the
> join’s performance will be bad because that property seems global.
>
> I can do something like this:
>
> event_wk.reparitition(2).write.partitionBy("eventDate").format("parquet").save(path)
>
> but I have to handle adding partitions by myself.
>
> Is there a way you can control the number of files just for this last
> insert step?
> ​
>


Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-01-30 Thread Alan Prando
Hi Folks!

I am trying to implement a spark job to calculate the similarity of my database 
products, using only name and descriptions.
I would like to use TF-IDF to represent my text data and cosine similarity to 
calculate all similarities.

My goal is, after job completes, get all similarities as a list. 
For example:
Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
Prod3 = ((Prod1, 0.98))
Prod4 = ((Prod1, 0.53))

However, I am new with Spark and I am having issues to use understanding what 
cosine similarity returns!

My code:
val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split(" 
").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()

val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

val mat = new RowMatrix(tfidf)

// Compute similar columns perfectly, with brute force.
val exact = mat.columnSimilarities()

// Compute similar columns with estimation using DIMSUM
val approx = mat.columnSimilarities(0.1)

val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, 
j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, 
j), v) }

The file is just products name and description in each row.

The return I got:
approxEntries.first()
res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)

How can I figure out  what row this return is about?

Thanks in advance! =]



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Abid Malik
Dear all;


Is there any work in this area?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/deep-learning-with-heterogeneous-cloud-computing-using-spark-tp26109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: can't kill spark job in supervise mode

2016-01-30 Thread Tim Chen
Hi Duc,

Are you running Spark on Mesos with cluster mode? And what's your cluster
mode submission, and version of Spark are you running?

Tim

On Sat, Jan 30, 2016 at 8:19 AM, PhuDuc Nguyen 
wrote:

> I have a spark job running on Mesos in multi-master and supervise mode. If
> I kill it, it is resilient as expected and respawns on another node.
> However, I cannot kill it when I need to. I have tried 2 methods:
>
> 1) ./bin/spark-class org.apache.spark.deploy.Client kill
>  
>
> 2) ./bin/spark-submit --master mesos:// --kill 
>
> Method 2, accepts the kill request but is respawned on another node.
> Ultimately, I can't get either method to kill the job. I suspect I have
> the wrong port for the master URL during the kill request for method 1?
> I've tried every combination of IP and port I can think of, is there one I
> am missing?
>
> Ports I've tried:
> 5050 = mesos UI
> 8080 = marathon
> 7077 = spark dispatcher
> 8081 = spark drivers UI
> 4040 = spark job UI
>
> thanks,
> Duc
>


Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Christopher Nguyen
Thanks Nick :)


Abid, you may also want to check out
http://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43484,
which describes our work on a combination of Spark and Tachyon for Deep
Learning. We found significant gains in using Tachyon (with co-processing)
for the "descent" step while Spark computes the gradients. The video was
recently uploaded here http://bit.ly/1JnvQAO.


Regards,
-- 

*Algorithms of the Mind **http://bit.ly/1ReQvEW *

Christopher Nguyen
CEO & Co-Founder
www.Arimo.com (née Adatao)
linkedin.com/in/ctnguyen


Re: can't kill spark job in supervise mode

2016-01-30 Thread PhuDuc Nguyen
Hi Tim,

Yes we are running Spark on Mesos in cluster mode with supervise flag.
Submit script looks like this:

spark-submit \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+UseCompressedOops
-XX:-UseGCOverheadLimit" \
--supervise \
--deploy-mode cluster \
--class  \
--master mesos://:7077 

Mesos version = 0.26.0
Spark version = 1.5.2


thanks,
Duc

On Sat, Jan 30, 2016 at 9:48 AM, Tim Chen  wrote:

> Hi Duc,
>
> Are you running Spark on Mesos with cluster mode? And what's your cluster
> mode submission, and version of Spark are you running?
>
> Tim
>
> On Sat, Jan 30, 2016 at 8:19 AM, PhuDuc Nguyen 
> wrote:
>
>> I have a spark job running on Mesos in multi-master and supervise mode.
>> If I kill it, it is resilient as expected and respawns on another node.
>> However, I cannot kill it when I need to. I have tried 2 methods:
>>
>> 1) ./bin/spark-class org.apache.spark.deploy.Client kill
>>  
>>
>> 2) ./bin/spark-submit --master mesos:// --kill 
>>
>> Method 2, accepts the kill request but is respawned on another node.
>> Ultimately, I can't get either method to kill the job. I suspect I have
>> the wrong port for the master URL during the kill request for method 1?
>> I've tried every combination of IP and port I can think of, is there one I
>> am missing?
>>
>> Ports I've tried:
>> 5050 = mesos UI
>> 8080 = marathon
>> 7077 = spark dispatcher
>> 8081 = spark drivers UI
>> 4040 = spark job UI
>>
>> thanks,
>> Duc
>>
>
>


Re: Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-30 Thread Nirav Patel
Thanks Ted. In my application jar there was no spark 1.3.1 artifacts.
Anyhow I got it working via Oozie spark action.

On Thu, Jan 28, 2016 at 7:42 PM, Ted Yu  wrote:

> Looks like '--properties-file' is no longer supported.
>
> Was it possible that Spark 1.3.1 artifact / dependency leaked into your
> app ?
>
> Cheers
>
> On Thu, Jan 28, 2016 at 7:36 PM, Nirav Patel 
> wrote:
>
>> Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client
>> mode programmatically via creating a sparkConf and sparkContext object
>> manually. It was inspired from spark self-contained application example
>> here:
>>
>>
>> https://spark.apache.org/docs/1.5.2/quick-start.html#self-contained-applications\
>>
>> Only additional configuration we would provide would be all related to
>> yarn like executor instance, cores, memory, extraJavaOptions etc.
>>
>> However after upgrading to spark 1.5.2 above application breaks on a line
>> `val sparkContext = new SparkContext(sparkConf)`
>>
>> 16/01/28 17:38:35 ERROR util.Utils: Uncaught exception in thread main
>>
>> java.lang.NullPointerException
>>
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
>>
>> at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
>>
>> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
>>
>> at
>> org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
>>
>> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
>>
>> at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
>>
>> at org.apache.spark.SparkContext.(SparkContext.scala:593)
>>
>>
>> *In yarn container logs I see following:*
>>
>> 16/01/28 17:38:29 INFO yarn.ApplicationMaster: Registered signal handlers 
>> for [TERM, HUP, INT]*Unknown/unsupported param List*(--properties-file, 
>> /tmp/hadoop-xactly/nm-local-dir/usercache/xactly/appcache/application_1453752281504_3427/container_1453752281504_3427_01_02/__spark_conf__/__spark_conf__.properties)
>>
>> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
>> Options:
>>   --jar JAR_PATH   Path to your application's JAR file
>>   --class CLASS_NAME   Name of your application's main class
>>   --primary-py-fileA main Python file
>>   --py-files PY_FILES  Comma-separated list of .zip, .egg, or .py files to
>>place on the PYTHONPATH for Python apps.
>>   --args ARGS  Arguments to be passed to your application's main 
>> class.
>>Multiple invocations are possible, each will be 
>> passed in order.
>>   --num-executors NUMNumber of executors to start (Default: 2)
>>   --executor-cores NUM   Number of cores for the executors (Default: 1)
>>   --executor-memory MEM  Memory per executor (e.g. 1000M, 2G) (Default: 1G)
>>
>>
>>
>> So is this approach still supposed to work? Or do I must use
>> SparkLauncher class with spark 1.5.2?
>>
>> Thanks
>>
>> Nirav
>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Nick Pentreath
Spark ML offers a multi-layer perceptron and has some machinery in place that 
will support development of further deep-learning models.

There is also deeplearning4j and some work on distributed tensorflow on Spark 
(https://spark-summit.org/east-2016/events/distributed-tensor-flow-on-spark-scaling-googles-deep-learning-library/)
 as well as a few Caffe-on-spark projects.

Sent from my iPhone

> On 30 Jan 2016, at 19:20, Abid Malik  wrote:
> 
> Dear all;
> 
> 
> Is there any work in this area?
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/deep-learning-with-heterogeneous-cloud-computing-using-spark-tp26109.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


can't kill spark job in supervise mode

2016-01-30 Thread PhuDuc Nguyen
I have a spark job running on Mesos in multi-master and supervise mode. If
I kill it, it is resilient as expected and respawns on another node.
However, I cannot kill it when I need to. I have tried 2 methods:

1) ./bin/spark-class org.apache.spark.deploy.Client kill 


2) ./bin/spark-submit --master mesos:// --kill 

Method 2, accepts the kill request but is respawned on another node.
Ultimately, I can't get either method to kill the job. I suspect I have the
wrong port for the master URL during the kill request for method 1? I've
tried every combination of IP and port I can think of, is there one I am
missing?

Ports I've tried:
5050 = mesos UI
8080 = marathon
7077 = spark dispatcher
8081 = spark drivers UI
4040 = spark job UI

thanks,
Duc