MesosClusterDispatcher problem : Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2016-06-27 Thread 杜宇軒
Hi all

I run spark on mesos cluster, and meet a problem : when I send 6 spark
drivers *at the same time*, I can get the Information on node3:8081 that
there are 4 drivers in "Launched Drivers" and 2 in "Queueed Drivers". On
mesos:5050, I can see there are 4 active tasks are running, but each task
sends the warning : "Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient
resources". At this time, I see mesos all cpus are used on node1:5050 and
running forever until I kiil a task.

My question is : should I control each spark driver cores by myself or
MesosClusterDispatcher can help me to control driver-cores ?

Could someone can help me ?

Thank you

spark-submit script :
bin/spark-submit --name org.apache.spark.examples.SparkPi --deploy-mode
cluster --supervise --master mesos://node3:7077 --driver-cores 1.0
--driver-memory 1024M --class org.apache.spark.examples.SparkPi
hdfs://node4:9000/spark/spark/lib/spark-examples-1.6.1-hadoop2.6.0.jar 10

My environment :
mesos version 0.28.1 , node1 master, node2 slave
MesosClusterDispatcher, node3
hadoop cluster version 2.6.4, node4, node5

each node has 4cpu, 8G ram

config file :
spark-defaults-conf
spark.mastermesos://node3:7077
spark.executor.uri
 hdfs://node4:9000/spark/spark-1.6.1-bin-hadoop2.6.tgz
spark.mesos.executor.home   /root/spark

.

*Stack Trace* :

I0627 22:18:41.414885  3422 sched.cpp:703] Framework registered with
24f88e71-0eee-4023-9e5d-2e4595e2c5b4-0002
16/06/27 22:18:41 INFO mesos.CoarseMesosSchedulerBackend: Registered
as framework ID 24f88e71-0eee-4023-9e5d-2e4595e2c5b4-0002
16/06/27 22:18:41 INFO util.Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port
48636.
16/06/27 22:18:41 INFO netty.NettyBlockTransferService: Server created on 48636
16/06/27 22:18:41 INFO storage.BlockManagerMaster: Trying to register
BlockManager
16/06/27 22:18:41 INFO storage.BlockManagerMasterEndpoint: Registering
block manager 192.168.1.46:48636 with 511.5 MB RAM,
BlockManagerId(driver, 192.168.1.46, 48636)
16/06/27 22:18:41 INFO storage.BlockManagerMaster: Registered BlockManager
16/06/27 22:18:41 INFO mesos.CoarseMesosSchedulerBackend:
SchedulerBackend is ready for scheduling beginning after reached
minRegisteredResourcesRatio: 0.0
16/06/27 22:18:42 INFO spark.SparkContext: Starting job: reduce at
SparkPi.scala:36
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Got job 0 (reduce at
SparkPi.scala:36) with 10 output partitions
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Final stage:
ResultStage 0 (reduce at SparkPi.scala:36)
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Submitting ResultStage
0 (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no
missing parents
16/06/27 22:18:42 INFO storage.MemoryStore: Block broadcast_0 stored
as values in memory (estimated size 1904.0 B, free 1904.0 B)
16/06/27 22:18:42 INFO storage.MemoryStore: Block broadcast_0_piece0
stored as bytes in memory (estimated size 1218.0 B, free 3.0 KB)
16/06/27 22:18:42 INFO storage.BlockManagerInfo: Added
broadcast_0_piece0 in memory on 192.168.1.46:48636 (size: 1218.0 B,
free: 511.5 MB)
16/06/27 22:18:42 INFO spark.SparkContext: Created broadcast 0 from
broadcast at DAGScheduler.scala:1006
16/06/27 22:18:42 INFO scheduler.DAGScheduler: Submitting 10 missing
tasks from ResultStage 0 (MapPartitionsRDD[1] at map at
SparkPi.scala:32)
16/06/27 22:18:42 INFO scheduler.TaskSchedulerImpl: Adding task set
0.0 with 10 tasks
16/06/27 22:18:58 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient resources
16/06/27 22:19:13 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient resources
16/06/27 22:19:28 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient resources


[ANNOUNCE] Announcing Spark 1.6.2

2016-06-27 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.2! This maintenance
release includes fixes across several areas of Spark. You can find the list
of changes here: https://s.apache.org/spark-1.6.2

And download the release here: http://spark.apache.org/downloads.html


Re: Running into issue using SparkIMain

2016-06-27 Thread Jayant Shekhar
I tried setting the classpath explicitly in the settings. Classpath gets
printed properly, it has the scala jars in it like
scala-compiler-2.10.4.jar, scala-library-2.10.4.jar.

It did not help. Still runs great with IntelliJ, but runs into issues when
running from the command line.

val cl = this.getClass.getClassLoader

val urls = cl match {

  case cl: java.net.URLClassLoader => cl.getURLs.toList

  case a => sys.error("oops: I was expecting an URLClassLoader, found a
" + a.getClass)

}

val classpath = urls map {_.toString}

println("classpath=" + classpath);

settings.classpath.value =
classpath.distinct.mkString(java.io.File.pathSeparator)

settings.embeddedDefaults(cl)


-Jayant


On Mon, Jun 27, 2016 at 3:19 PM, Jayant Shekhar 
wrote:

> Hello,
>
> I'm trying to run scala code in  a Web Application.
>
> It runs great when I am running it in IntelliJ
> Run into error when I run it from the command line.
>
> Command used to run
> --
>
> java -Dscala.usejavacp=true  -jar target/XYZ.war 
> --spring.config.name=application,db,log4j
> --spring.config.location=file:./conf/history
>
> Error
> ---
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not
> found.
>
> ** Note that as of 2.8 scala does not assume use of the java classpath.
>
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>
> ** object programatically, settings.usejavacp.value = true.
>
> 16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
> set up.  Assuming no postInit code.
>
>
> I'm also setting the following:
> 
>
> val settings = new Settings()
>
>  settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())
>
>  settings.usejavacp.value = true
>
> Any pointers to the solution would be great.
>
> Thanks,
> Jayant
>
>


unsubscribe

2016-06-27 Thread Thomas Ginter


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running into issue using SparkIMain

2016-06-27 Thread Jayant Shekhar
Hello,

I'm trying to run scala code in  a Web Application.

It runs great when I am running it in IntelliJ
Run into error when I run it from the command line.

Command used to run
--

java -Dscala.usejavacp=true  -jar target/XYZ.war
--spring.config.name=application,db,log4j
--spring.config.location=file:./conf/history

Error
---

Failed to initialize compiler: object scala.runtime in compiler mirror not
found.

** Note that as of 2.8 scala does not assume use of the java classpath.

** For the old behavior pass -usejavacp to scala, or if using a Settings

** object programatically, settings.usejavacp.value = true.

16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
set up.  Assuming no postInit code.


I'm also setting the following:


val settings = new Settings()

 settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())

 settings.usejavacp.value = true

Any pointers to the solution would be great.

Thanks,
Jayant


Re: What is the explanation of "ConvertToUnsafe" in "Physical Plan"

2016-06-27 Thread Xinh Huynh
I guess it has to do with the Tungsten explicit memory management that
builds on sun.misc.Unsafe. The "ConvertToUnsafe" class converts
Java-object-based rows into UnsafeRows, which has the Spark internal
memory-efficient format.

Here is the related code in 1.6:

ConvertToUnsafe is defined in:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/rowFormatConverters.scala

/**
* Converts Java-object-based rows into [[UnsafeRow]]s.
*/
case class ConvertToUnsafe(child: SparkPlan) extends UnaryNode

And, UnsafeRow is defined in:
https://github.com/apache/spark/blob/branch-1.6/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

/**
* An Unsafe implementation of Row which is backed by raw memory instead of
Java objects.
*
* Each tuple has three parts: [null bit set] [values] [variable length
portion]
*
* The bit set is used for null tracking and is aligned to 8-byte word
boundaries. It stores
* one bit per field.
*
* In the `values` region, we store one 8-byte word per field. For fields
that hold fixed-length
* primitive types, such as long, double, or int, we store the value
directly in the word. For
* fields with non-primitive or variable-length values, we store a relative
offset (w.r.t. the
* base address of the row) that points to the beginning of the
variable-length field, and length
* (they are combined into a long).
*
* Instances of `UnsafeRow` act as pointers to row data stored in this
format.
*/
public final class UnsafeRow extends MutableRow implements Externalizable,
KryoSerializable

Xinh

On Sun, Jun 26, 2016 at 1:11 PM, Mich Talebzadeh 
wrote:

>
> Hi,
>
> In Spark's Physical Plan what is the explanation for ConvertToUnsafe?
>
> Example:
>
> scala> sorted.filter($"prod_id" ===13).explain
> == Physical Plan ==
> Filter (prod_id#10L = 13)
> +- Sort [prod_id#10L ASC,cust_id#11L ASC,time_id#12 ASC,channel_id#13L
> ASC,promo_id#14L ASC], true, 0
>+- ConvertToUnsafe
>   +- Exchange rangepartitioning(prod_id#10L ASC,cust_id#11L
> ASC,time_id#12 ASC,channel_id#13L ASC,promo_id#14L ASC,200), None
>  +- HiveTableScan
> [prod_id#10L,cust_id#11L,time_id#12,channel_id#13L,promo_id#14L],
> MetastoreRelation oraclehadoop, sales2, None
>
>
> My inclination is that it is a temporary construct like tempTable created
> as part of Physical Plan?
>
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


JavaSparkContext: dependency on ui/

2016-06-27 Thread jay vyas
I notice that there is a dependency from the SparkContext on the
"createLiveUI" functionality.

Is that really required?  Or is there a more minimal JAvaSparkContext we
can create?

Im packaging a jar with a spark client and would rather avoid resource/
dependencys as they might be trickier to maintain than just class deps
alone.


java.lang.Exception: Could not find resource path for Web UI:
org/apache/spark/ui/static
at
org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:182)
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:73)
at org.apache.spark.ui.SparkUI.(SparkUI.scala:81)
at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
at org.apache.spark.SparkContext.(SparkContext.scala:445)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)

-- 
jay vyas


Re: Best practice for handing tables between pipeline components

2016-06-27 Thread Gene Pang
Yes, Alluxio (http://www.alluxio.org/) can be used to store data in-memory
between stages in a pipeline.

Here is more information about running Spark with Alluxio:
http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html

Hope that helps,
Gene

On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Alluxio off heap memory would help to share cached objects
>
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson 
> wrote:
>
>> Hi,
>>
>> We have a pipeline of components strung together via Airflow running on
>> AWS. Some of them are implemented in Spark, but some aren't. Generally they
>> can all talk to a JDBC/ODBC end point or read/write files from S3.
>>
>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
>> or S3 and reading it back in, again, in every component, if it could stay
>> cached in memory in a Spark cluster.
>>
>> Our current investigation seems to lead us towards exploring if the
>> following things are possible:
>>
>>- Using a Hive metastore with S3 as its backing data store to try to
>>keep a mapping from table name to files on S3 (not sure if one can cache a
>>Hive table in Spark across contexts, though)
>>- Using something like the spark-jobserver to keep a Spark SQLContext
>>open across Spark components so they could avoid file I/O for cached 
>> tables
>>
>> What's the best practice for handing tables between Spark programs? What
>> about between Spark and non-Spark programs?
>>
>> Thanks!
>>
>> - Everett
>>
>>


MapWithState would not restore from checkpoint.

2016-06-27 Thread Sergey Zelvenskiy
MapWithState would not restore from checkpoint. MapRDD code requires non
empty spark contexts, while the context is empty.


ERROR 2016-06-27 11:06:33,236 0 org.apache.spark.streaming.StreamingContext
[run-main-0] Error starting the context, marking it as stopped
org.apache.spark.SparkException: RDD transformations and actions can only
be invoked by the driver, not inside of other transformations; for example,
rdd1.map(x => rdd2.values.count() * x) is invalid because the values
transformation and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org 
$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at
org.apache.spark.rdd.PairRDDFunctions.partitionBy(PairRDDFunctions.scala:530)
at
org.apache.spark.streaming.rdd.MapWithStateRDD$.createFromPairRDD(MapWithStateRDD.scala:189)
at
org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:146)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at
org.apache.spark.streaming.dstream.MapWithStateDStreamImpl.compute(MapWithStateDStream.scala:65)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at
org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at

Spark ML - Java implementation of custom Transformer

2016-06-27 Thread Mehdi Meziane



Hi all, 


We have some problems while implementing custom Transformers in JAVA (SPARK 
1.6.1). 
We do override the method copy, but it crashes with an AbstractMethodError. 


If we extends the UnaryTransformer, and do not override the copy method, it 
works without any error. 


We tried to write the copy like in these examples : 
https://github.com/apache/spark/blob/branch-2.0/mllib/src/test/java/org/apache/spark/ml/param/JavaTestParams.java
 
https://github.com/eBay/Spark/blob/branch-1.6/examples/src/main/java/org/apache/spark/examples/ml/JavaDeveloperApiExample.java
 


None of it worked. 


The copy is defined in the Params class as : 


/** 
* Creates a copy of this instance with the same UID and some extra params. 
* Subclasses should implement this method and set the return type properly. 
* 
* @see [[defaultCopy()]] 
*/ 
def copy(extra: ParamMap): Params 


Any idea? 
Thanks, 


Mehdi 

Re: Best practice for handing tables between pipeline components

2016-06-27 Thread Sathish Kumaran Vairavelu
Alluxio off heap memory would help to share cached objects
On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson 
wrote:

> Hi,
>
> We have a pipeline of components strung together via Airflow running on
> AWS. Some of them are implemented in Spark, but some aren't. Generally they
> can all talk to a JDBC/ODBC end point or read/write files from S3.
>
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
> or S3 and reading it back in, again, in every component, if it could stay
> cached in memory in a Spark cluster.
>
> Our current investigation seems to lead us towards exploring if the
> following things are possible:
>
>- Using a Hive metastore with S3 as its backing data store to try to
>keep a mapping from table name to files on S3 (not sure if one can cache a
>Hive table in Spark across contexts, though)
>- Using something like the spark-jobserver to keep a Spark SQLContext
>open across Spark components so they could avoid file I/O for cached tables
>
> What's the best practice for handing tables between Spark programs? What
> about between Spark and non-Spark programs?
>
> Thanks!
>
> - Everett
>
>


RE: Utils and Logging cannot be accessed in package ....

2016-06-27 Thread Paolo Patierno
Yes I have just realized that the code I was reading was in the 
org.apache.spark package related to customer receiver implementations.

Thanks.

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor 
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience

From: yuzhih...@gmail.com
Date: Mon, 27 Jun 2016 10:28:50 -0700
Subject: Re: Utils and Logging cannot be accessed in package 
To: ppatie...@live.com
CC: user@spark.apache.org

AFAICT Utils is private:
private[spark] object Utils extends Logging {

So is Logging:
private[spark] trait Logging {

FYI
On Mon, Jun 27, 2016 at 8:20 AM, Paolo Patierno  wrote:



Hello,

I'm trying to use the Utils.createTempDir() method importing 
org.apache.spark.util.Utils but the scala compiler says me that :

object Utils in package util cannot be accessed in package org.apache.spark.util

I'm facing the same problem with Logging.

My sbt file has following dependency :

"org.apache.spark" %% "spark-core" % sparkVersion.value % "provided" classifier 
"tests"

where spark version is "2.0.0-SNAPSHOT".

Any ideas about this problem ?

Thanks,
Paolo.
  

  

Re: Utils and Logging cannot be accessed in package ....

2016-06-27 Thread Ted Yu
AFAICT Utils is private:

private[spark] object Utils extends Logging {

So is Logging:

private[spark] trait Logging {

FYI

On Mon, Jun 27, 2016 at 8:20 AM, Paolo Patierno  wrote:

> Hello,
>
> I'm trying to use the Utils.createTempDir() method importing
> org.apache.spark.util.Utils but the scala compiler says me that :
>
> object Utils in package util cannot be accessed in package
> org.apache.spark.util
>
> I'm facing the same problem with Logging.
>
> My sbt file has following dependency :
>
> "org.apache.spark" %% "spark-core" % sparkVersion.value % "provided"
> classifier "tests"
>
> where spark version is "2.0.0-SNAPSHOT".
>
> Any ideas about this problem ?
>
> Thanks,
> Paolo.
>


Re: Arrays in Datasets (1.6.1)

2016-06-27 Thread Ted Yu
Can you show the stack trace for encoding error(s) ?

Have you looked at the following test which involves NestedArray of
primitive type ?

./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala

Cheers

On Mon, Jun 27, 2016 at 8:50 AM, Daniel Imberman 
wrote:

> Hi all,
>
> So I've been attempting to reformat a project I'm working on to use the
> Dataset API and have been having some issues with encoding errors. From
> what I've read, I think that I should be able to store Arrays of primitive
> values in a dataset. However, the following class gives me encoding errors:
>
> case class InvertedIndex(partition:Int, docs:Array[Int],
> indices:Array[Long], weights:Array[Double])
>
> val inv = RDD[InvertedIndex]
> val invertedIndexDataset = sqlContext.createDataset(inv)
> invertedIndexDataset.groupBy(x => x.partition).mapGroups {
> //...
> }
>
> Could someone please help me understand what the issue is here? Can
> Datasets not currently handle Arrays of primitives, or is there something
> extra that I need to do to make them work?
>
> Thank you
>
>


Best practice for handing tables between pipeline components

2016-06-27 Thread Everett Anderson
Hi,

We have a pipeline of components strung together via Airflow running on
AWS. Some of them are implemented in Spark, but some aren't. Generally they
can all talk to a JDBC/ODBC end point or read/write files from S3.

Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or
S3 and reading it back in, again, in every component, if it could stay
cached in memory in a Spark cluster.

Our current investigation seems to lead us towards exploring if the
following things are possible:

   - Using a Hive metastore with S3 as its backing data store to try to
   keep a mapping from table name to files on S3 (not sure if one can cache a
   Hive table in Spark across contexts, though)
   - Using something like the spark-jobserver to keep a Spark SQLContext
   open across Spark components so they could avoid file I/O for cached tables

What's the best practice for handing tables between Spark programs? What
about between Spark and non-Spark programs?

Thanks!

- Everett


Arrays in Datasets (1.6.1)

2016-06-27 Thread Daniel Imberman
Hi all,

So I've been attempting to reformat a project I'm working on to use the
Dataset API and have been having some issues with encoding errors. From
what I've read, I think that I should be able to store Arrays of primitive
values in a dataset. However, the following class gives me encoding errors:

case class InvertedIndex(partition:Int, docs:Array[Int],
indices:Array[Long], weights:Array[Double])

val inv = RDD[InvertedIndex]
val invertedIndexDataset = sqlContext.createDataset(inv)
invertedIndexDataset.groupBy(x => x.partition).mapGroups {
//...
}

Could someone please help me understand what the issue is here? Can
Datasets not currently handle Arrays of primitives, or is there something
extra that I need to do to make them work?

Thank you


run spark sql with script transformation faild

2016-06-27 Thread linxi zeng
Hi, all:
Recently, we are trying to compare with spark sql and hive on MR, and I
have tried to run spark (spark1.6 rc2) sql with script transformation, the
spark job faild and get an error message like:

16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code
generated in 19.054534 ms

16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread:
/bin/bash: test.py: command not found



16/06/26 11:01:28 ERROR util.Utils: Uncaught exception in thread
Thread-ScriptTransformation-Feed

java.io.IOException: Stream closed

at 
java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)

at java.io.OutputStream.write(OutputStream.java:116)

at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at 
org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:277)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:255)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:255)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:244)

16/06/26 11:01:28 ERROR util.SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Thread-ScriptTransformation-Feed,5,main]

java.io.IOException: Stream closed

at 
java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)

at java.io.OutputStream.write(OutputStream.java:116)

at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at 
org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:277)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(ScriptTransformation.scala:255)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformation.scala:255)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformation.scala:244)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)

at 
org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformation.scala:244)


cmd is:

> spark-1.6/bin/spark-sql -f transform.sql


the sql and python script is:
transform.sql (which was executed successfully on hive) :

> add file /tmp/spark_sql_test/test.py;
> select transform(cityname) using 'test.py' as (new_cityname) from
> test.spark2_orc where dt='20160622' limit 5 ;

test.py:

> #!/usr/bin/env python
> #coding=utf-8
> import sys
> import string
> reload(sys)
> sys.setdefaultencoding('utf8')
> for line in sys.stdin:
> cityname = line.strip("\n").split("\t")[0]
> lt = []
> lt.append(cityname + "_zlx")
> print "\t".join(lt)


And after making two modifications:
(1) chmod +x test.py
(2) transform.sql:using 'test.py'  ->  using './test.py'
the sql executed successfully.
I was wonder that if the spark sql with script transformation should be run
like this way? Any one meet the same problem?


Utils and Logging cannot be accessed in package ....

2016-06-27 Thread Paolo Patierno
Hello,

I'm trying to use the Utils.createTempDir() method importing 
org.apache.spark.util.Utils but the scala compiler says me that :

object Utils in package util cannot be accessed in package org.apache.spark.util

I'm facing the same problem with Logging.

My sbt file has following dependency :

"org.apache.spark" %% "spark-core" % sparkVersion.value % "provided" classifier 
"tests"

where spark version is "2.0.0-SNAPSHOT".

Any ideas about this problem ?

Thanks,
Paolo.
  

Unsubscribe

2016-06-27 Thread Steve Florence



Spark partition formula on standalone mode?

2016-06-27 Thread kali.tumm...@gmail.com
Hi All,

I did worked on spark installed on Hadoop cluster but never worked on spark
on standalone cluster.

My question how to set number of partitions in spark when it's running on
spark standalone cluster?

If spark on Hadoop I calculate my formula using hdfs block sizes but how I
calculate without hdfs block size if spark running on standalone non Hadoop
cluster.
 
Partition formula for 100gb file:-
Hdfs block size:-256

100*1024  =400 partitions
/256

Executors:- 100/4= 25

Executor memory:- 160gb/25=7







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-formula-on-standalone-mode-tp27237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



GraphX :Running on a Cluster

2016-06-27 Thread isaranto
Hi,
I have been trying to run some algorithms i have implemented using GraphX
and Spark.
I have been running these algorithms locally by starting a local spark
instance through IntelliJ (in scala).
However when I try to run them on a cluster with 10 machines I get 

java.lang.ClassNotFoundException: org.centrality.spark.DegreeCentrality
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

when running 

spark-submit --master spark://sparkmaster:7077 --class
org.centrality.spark.DegreeCentrality target/spark-1.0-SNAPSHOT.jar

the jar is a fat jar which includes all the dependencies..

I also tried running each algorithm as a script which is being loaded in
spark shell ("spark-shell -i DegreeCentrality.scala", I also include the
System.exit(0) at the end.) . However when I ran the script it gets stuck
while the graph is being created "val graph = GraphLoader.edgeListFile(sc,
"hdfs://sparkmaster:7077/user/ilias/graphx/followers.txt"). I have been able
to run a simple WordCount (by reading from an hdfs file as well) example but
have still not been able to create a graph from a textfile.

Thanks a lot...
Ilias




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Running-on-a-Cluster-tp27236.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Substract two DStreams

2016-06-27 Thread Marius Soutier
Can't you use `transform` instead of `foreachRDD`?


> On 15.06.2016, at 15:18, Matthias Niehoff  
> wrote:
> 
> Hi,
> 
> i want to subtract 2 DStreams (based on the same Input Stream) to get all 
> elements that exist in the original stream, but not in the modified stream 
> (the modified Stream is changed using joinWithCassandraTable which does an 
> inner join and because of this might remove entries).
> 
> Subtract is only possible on RDDs. So I could use a foreachRDD right in the 
> beginning of the Stream processing and work on rdds. I think its quite ugly 
> to use the output op at the beginning and then implement a lot of 
> transformations in the foreachRDD. So could you think of different ways to do 
> an efficient diff between to DStreams?
> 
> Thank you
> 
> -- 
> Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
> codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
> tel: +49 (0) 721.9595-681  | fax: +49 (0) 
> 721.9595-666  | mobil: +49 (0) 
> 172.1702676 
> www.codecentric.de  | blog.codecentric.de 
>  | www.meettheexperts.de 
>  | www.more4fi.de  
> 
> Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
> 
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. 
> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht 
> gestattet



Spark SQL poor join performance

2016-06-27 Thread Samo Sarajevo
I'm using SparkSQL to make fact table out of 5 dimensions. I'm facing 
performance issue (job is taking several hours to complete), and even after 
exhaustive googleing I see no solution. These are settings I have tried turing, 
but no sucess. 

sqlContext.sql("set spark.sql.shuffle.partitions=10"); // varied between 10 and 
5000 
sqlContext.sql("set spark.sql.autoBroadcastJoinThreshold=5"); // 500 
MB, tried 1 GB 

Most of RDDs are nicely parittions (500 partitions each), however largest 
dimension is not partitioned at all (images) http://imgur.com/a/cUC3d. Below is 
code I have used for making fact table. 

resultDmn1.registerTempTable("Dmn1"); 
    resultDmn2.registerTempTable("Dmn2"); 
    resultDmn3.registerTempTable("Dmn3"); 
    resultDmn4.registerTempTable("Dmn4"); 
    resultDmn5.registerTempTable("Dmn5"); 

    DataFrame resultFact = sqlContext.sql("SELECT DISTINCT\n" + 
            "    0 AS FactId,\n" + 
            "    rs.c28 AS c28,\n" + 
            "    dop.DmnId AS dmn_id_dim4,\n" + 
            "    dh.DmnId AS dmn_id_dim5,\n" + 
            "    op.DmnId AS dmn_id_dim3,\n" + 
            "    du.DmnId AS dmn_id_dim2,\n" + 
            "    dc.DmnId AS dmn_id_dim1\n" + 
            "FROM\n" + 
            "    t10 rs\n" + 
            "        JOIN\n" + 
            "    t11 r ON rs.c29 = r.id\n" + 
            "        JOIN\n" + 
            "    Dmn4 dop ON dop.c26 = r.c25\n" + 
            "        JOIN\n" + 
            "    Dmn5 dh ON dh.Date = r.c27\n" + 
            "        JOIN\n" + 
            "    Dmn3 du ON du.c9 = r.c16\n" + 
            "        JOIN\n" + 
            "    t1 d ON r.c5 = d.id\n" + 
            "        JOIN\n" + 
            "    t2 di ON d.id = di.c5\n" + 
            "        JOIN\n" + 
            "    t3 s ON d.c6 = s.id\n" + 
            "        JOIN\n" + 
            "    t4 p ON s.c7 = p.id\n" + 
            "        JOIN\n" + 
            "    t5 o ON p.c8 = o.id\n" + 
            "        JOIN\n" + 
            "    Dmn1 op ON op.c1 = di.c1\n" + 
            "        JOIN\n" + 
            "    t9 ci ON ci.id = r.c24\n" + 
            "        JOIN\n" + 
            "    Dmn3 dc ON dc.c18 = ci.c23\n" + 
            "WHERE\n" + 
            "    op.c2 = di.c2\n" + 
            "        AND o.name = op.c30\n" + 
            "        AND di.c3 = op.c3\n" + 
            "        AND di.c4 = op.c4").toSchemaRDD(); 

     resultFact.count(); 
     resultFact.cache(); 

Dmn1 has 56 rows, dmn2 11, dmn3 10, dmn4 12, and dmn5 1275533 rows prior this 
join. Everything is running on AWS EMR cluster, with 3 m3.2xlarge nodes in 
cluster (master + 2 slaves). 

Here is result of explain: http://pastebin.com/ZRUdUuYT

Spark SQL poor join performance

2016-06-27 Thread vegass
I'm using SparkSQL to make fact table out of 5 dimensions. I'm facing
performance issue (job is taking several hours to complete), and even after
exhaustive googleing I see no solution. These are settings I have tried
turing, but no sucess. 

sqlContext.sql("set spark.sql.shuffle.partitions=10"); // varied between 10
and 5000 
sqlContext.sql("set spark.sql.autoBroadcastJoinThreshold=5"); // 500
MB, tried 1 GB 

Most of RDDs are nicely parittions (500 partitions each), however largest
dimension is not partitioned at all ( images   ).
Maybe this can lead to solution ? Below is code I have used for making fact
table. 

resultDmn1.registerTempTable("Dmn1"); 
resultDmn2.registerTempTable("Dmn2"); 
resultDmn3.registerTempTable("Dmn3"); 
resultDmn4.registerTempTable("Dmn4"); 
resultDmn5.registerTempTable("Dmn5"); 

DataFrame resultFact = sqlContext.sql("SELECT DISTINCT\n" + 
"0 AS FactId,\n" + 
"rs.c28 AS c28,\n" + 
"dop.DmnId AS dmn_id_dim4,\n" + 
"dh.DmnId AS dmn_id_dim5,\n" + 
"op.DmnId AS dmn_id_dim3,\n" + 
"du.DmnId AS dmn_id_dim2,\n" + 
"dc.DmnId AS dmn_id_dim1\n" + 
"FROM\n" + 
"t10 rs\n" + 
"JOIN\n" + 
"t11 r ON rs.c29 = r.id\n" + 
"JOIN\n" + 
"Dmn4 dop ON dop.c26 = r.c25\n" + 
"JOIN\n" + 
"Dmn5 dh ON dh.Date = r.c27\n" + 
"JOIN\n" + 
"Dmn3 du ON du.c9 = r.c16\n" + 
"JOIN\n" + 
"t1 d ON r.c5 = d.id\n" + 
"JOIN\n" + 
"t2 di ON d.id = di.c5\n" + 
"JOIN\n" + 
"t3 s ON d.c6 = s.id\n" + 
"JOIN\n" + 
"t4 p ON s.c7 = p.id\n" + 
"JOIN\n" + 
"t5 o ON p.c8 = o.id\n" + 
"JOIN\n" + 
"Dmn1 op ON op.c1 = di.c1\n" + 
"JOIN\n" + 
"t9 ci ON ci.id = r.c24\n" + 
"JOIN\n" + 
"Dmn3 dc ON dc.c18 = ci.c23\n" + 
"WHERE\n" + 
"op.c2 = di.c2\n" + 
"AND o.name = op.c30\n" + 
"AND di.c3 = op.c3\n" + 
"AND di.c4 = op.c4").toSchemaRDD(); 

 resultFact.count(); 
 resultFact.cache(); 

Dmn1 has 56 rows, dmn2 11, dmn3 10, dmn4 12, and dmn5 1275533 rows prior
this join. Everything is running on AWS EMR cluster, with 3 m3.2xlarge nodes
in cluster (master + 2 slaves). 

Here is result of explain: http://pastebin.com/ZRUdUuYT



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-poor-join-performance-tp27235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Querying Hive tables from Spark

2016-06-27 Thread Mich Talebzadeh
Hi,

I have done some extensive tests with Spark querying Hive tables.

It appears to me that Spark does not rely on statistics that are collected
by Hive on say ORC tables. It seems that Spark uses its own optimization to
query the Hive tables irrespective of Hive has collected by way of
statistics etc?

Case in point I have a FACT table bucketed on 5 dimensional foreign keys
like below

 CREATE TABLE IF NOT EXISTS oraclehadoop.sales2
 (
  PROD_IDbigint   ,
  CUST_IDbigint   ,
  TIME_IDtimestamp,
  CHANNEL_ID bigint   ,
  PROMO_ID   bigint   ,
  QUANTITY_SOLD  decimal(10)  ,
  AMOUNT_SOLDdecimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")

Table is sorted in the order of prod_id, cust_id,time_id, channel_id and
promo_id. It has 22 million rows.

A simple query like below:

val s = HiveContext.table("sales2")
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" ===
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999
).explain
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" ===
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999
).collect.foreach(println)

Shows the plan as

== Physical Plan ==
Filter (prod_id#10L = 13) && (cust_id#11L = 50833)) && (time_id#12 =
9777888)) && (channel_id#13L = 2)) && (promo_id#14L = 999))
+- HiveTableScan
[prod_id#10L,cust_id#11L,time_id#12,channel_id#13L,promo_id#14L,quantity_sold#15,amount_sold#16],
MetastoreRelation oraclehadoop, sales2, None

*Spark returns 24 rows pretty fast in 22 seconds.*

Running the same on Hive with Spark as execution engine shows:

STAGE DEPENDENCIES:
  Stage-0 is a root stage
STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: sales2
  Filter Operator
predicate: (prod_id = 13) and (cust_id = 50833)) and
(UDFToString(time_id) = '2000-12-26 00:00:00')) and (channel_id = 2)) and
(promo_id = 999)) (type: boolean)
Select Operator
  expressions: 13 (type: bigint), 50833 (type: bigint),
2000-12-26 00:00:00.0 (type: timestamp), 2 (type: bigint), 999 (type:
bigint), quantity_sold (type: decimal(10,0)), amount_sold (type:
decimal(10,0))
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6
  ListSink

*And Hive on Spark returns the same 24 rows in 30 seconds*

Ok Hive query is just slower with Spark engine.

Assuming that the time taken will be optimization time + query time then it
appears that in most cases the optimization time does not really make that
impact on the overall performance?


Let me know your thoughts.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


[Spark 1.6.1] Beeline cannot start on Windows7

2016-06-27 Thread Haopu Wang
I see below stack trace when trying to run beeline command. I'm using
JDK 7. 

Anything wrong? Much thanks!

 

==

D:\spark\download\spark-1.6.1-bin-hadoop2.4>bin\beeline

Beeline version 1.6.1 by Apache Hive

Exception in thread "main" java.lang.NoSuchMethodError:
org.fusesource.jansi.internal.Kernel32.GetConsoleOutputCP()I

at
jline.WindowsTerminal.getConsoleOutputCodepage(WindowsTerminal.java:293)

at
jline.WindowsTerminal.getOutputEncoding(WindowsTerminal.java:186)

at jline.console.ConsoleReader.(ConsoleReader.java:230)

at jline.console.ConsoleReader.(ConsoleReader.java:221)

at jline.console.ConsoleReader.(ConsoleReader.java:209)

at
org.apache.hive.beeline.BeeLine.getConsoleReader(BeeLine.java:834)

at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:770)

at
org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:48
4)

at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)



Last() Window Function

2016-06-27 Thread Anton Okolnychyi
Hi all!

I am learning Spark SQL and window functions.
The behavior of the last() window function was unexpected for me in one
case(for a person without any previous experience in the window functions).

I define my window specification as follows:
Window.partitionBy('transportType, 'route).orderBy('eventTime).

So, I have neither rowsBetween nor rangeBetween boundaries.
In this scenario, I expect to get the latest event (by time) in a group if
I apply the last('eventTime) window function over this window
specification.
However, this does not happen.

Looking at the code, I was able to figure out that if there are no
range/rows boundaries, the UnspecifiedFrame is assigned. Later, in
ResolveWindowFrame for the last() function, Spark assigns a default window
frame. The default frame depends on the presence of any order specification
(if one has an order specification, the default frame is RANGE BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW). That's why the last() window function
does work I as expected in my case. There is a very helpful comment in
SpecifiedWindowFrame. I wish I could find it in the documentation.

That's why I have 2 questions:
- Did I miss the place in the documentation where this behavior is
described? If no, would it be appropriate from my side to try to find where
this can be done?
- Would it be appropriate/useful to add some window function examples to
spark/examples? There are no such so far

Sincerely,
Anton Okolnychyi


Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-27 Thread Amit Sela
OK. I see that, but the current (provided) implementations are very naive -
Sum, Count, Average -let's take Max for example: I guess zero() would be
set to some value like Long.MIN_VALUE, but what if you trigger (I assume in
the future Spark streaming will support time-based triggers) for a result
and there are no events ?

And like I said, for a more general use case: What if my zero() function
depends on my input ?

I just don't see the benefit of this behaviour, though I realise this is
the implementation.

Thanks,
Amit

On Sun, Jun 26, 2016 at 2:09 PM Takeshi Yamamuro 
wrote:

> No, TypedAggregateExpression that uses Aggregator#zero is different
> between v2.0 and v1.6.
> v2.0:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91
> v1.6:
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115
>
> // maropu
>
>
> On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela  wrote:
>
>> This "if (value == null)" condition you point to exists in 1.6 branch as
>> well, so that's probably not the reason.
>>
>> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Whatever it is, this is expected; if an initial value is null, spark
>>> codegen removes all the aggregates.
>>> See:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199
>>>
>>> // maropu
>>>
>>> On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela  wrote:
>>>
 Not sure about what's the rule in case of `b + null = null` but the
 same code works perfectly in 1.6.1, just tried it..

 On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro 
 wrote:

> Hi,
>
> This behaviour seems to be expected because you must ensure `b +
> zero() = b`
> The your case `b + null = null` breaks this rule.
> This is the same with v1.6.1.
> See:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>
> // maropu
>
>
> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela 
> wrote:
>
>> Sometimes, the BUF for the aggregator may depend on the actual
>> input.. and while this passes the responsibility to handle null in
>> merge/reduce to the developer, it sounds fine to me if he is the one who
>> put null in zero() anyway.
>> Now, it seems that the aggregation is skipped entirely when zero() =
>> null. Not sure if that was the behaviour in 1.6
>>
>> Is this behaviour wanted ?
>>
>> Thanks,
>> Amit
>>
>> Aggregator example:
>>
>> public static class Agg extends Aggregator, 
>> Integer, Integer> {
>>
>>   @Override
>>   public Integer zero() {
>> return null;
>>   }
>>
>>   @Override
>>   public Integer reduce(Integer b, Tuple2 a) {
>> if (b == null) {
>>   b = 0;
>> }
>> return b + a._2();
>>   }
>>
>>   @Override
>>   public Integer merge(Integer b1, Integer b2) {
>> if (b1 == null) {
>>   return b2;
>> } else if (b2 == null) {
>>   return b1;
>> } else {
>>   return b1 + b2;
>> }
>>   }
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Difference between Dataframe and RDD Persisting

2016-06-27 Thread Jörn Franke
Dataframe uses a more efficient binary representation to store and persist 
data. You should go for that one in most of the cases. Rdd is slower.

> On 27 Jun 2016, at 07:54, Brandon White  wrote:
> 
> What is the difference between persisting a dataframe and a rdd? When I 
> persist my RDD, the UI says it takes 50G or more of memory. When I persist my 
> dataframe, the UI says it takes 9G or less of memory.
> 
> Does the dataframe not persist the actual content? Is it better / faster to 
> persist a RDD when doing a lot of filter, mapping, and collecting operations? 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Thrift Server Concurrency

2016-06-27 Thread Prabhu Joseph
Spark Thrift Server is started with

./sbin/start-thriftserver.sh --master yarn-client --hiveconf
hive.server2.thrift.port=10001 --num-executors 4 --executor-cores 2
--executor-memory 4G --conf spark.scheduler.mode=FAIR

20 parallel below queries are executed

select distinct val2 from philips1 where key>=1000 and key<=1500

And there is no issue at the backend Spark Executors, as spark jobs UI
shows all 20 queries are launched and completed with same duration. And all
20 queries are received by Spark Thrift Server at same time. But the Spark
Driver present inside Spark Thrift Sever  looks like overloaded and hence
the queries are not parsed and
submitted to executors at same time and hence seeing the delay in query
execution time from client.





On Thu, Jun 23, 2016 at 11:12 PM, Michael Segel 
wrote:

> Hi,
> There are  a lot of moving parts and a lot of unknowns from your
> description.
> Besides the version stuff.
>
> How many executors, how many cores? How much memory?
> Are you persisting (memory and disk) or just caching (memory)
>
> During the execution… same tables… are  you seeing a lot of shuffling of
> data for some queries and not others?
>
> It sounds like an interesting problem…
>
> On Jun 23, 2016, at 5:21 AM, Prabhu Joseph 
> wrote:
>
> Hi All,
>
>On submitting 20 parallel same SQL query to Spark Thrift Server, the
> query execution time for some queries are less than a second and some are
> more than 2seconds. The Spark Thrift Server logs shows all 20 queries are
> submitted at same time 16/06/23 12:12:01 but the result schema are at
> different times.
>
> 16/06/23 12:12:01 INFO SparkExecuteStatementOperation: Running query
> 'select distinct val2 from philips1 where key>=1000 and key<=1500
>
> 16/06/23 12:12:*02* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2110)
> 16/06/23 12:12:*03* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2182)
> 16/06/23 12:12:*04* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2344)
> 16/06/23 12:12:*05* INFO SparkExecuteStatementOperation: Result Schema:
> ArrayBuffer(val2#2362)
>
> There are sufficient executors running on YARN. The concurrency is
> affected by Single Driver. How to improve the concurrency and what are the
> best practices.
>
> Thanks,
> Prabhu Joseph
>
>
>