Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
If you're looking for some kind of instrumentation finer than at batch
boundaries, you'd have to do something with the individual messages
yourself.  You have full access to the individual messages including
offset.

On Thu, Apr 27, 2017 at 1:27 PM, Dominik Safaric
 wrote:
> Of course I am not asking to commit for every message. But instead of, 
> seeking to commit the last consumed offset at a given interval. For example, 
> from the 1st until the 5th second, messages until offset 100.000 of the 
> partition 10 were consumed, then from the 6th until the 10th second of 
> executing the last consumed offset of the same partition was 200.000 - and so 
> forth. This is the information I seek to get.
>
>> On 27 Apr 2017, at 20:11, Cody Koeninger  wrote:
>>
>> Are you asking for commits for every message?  Because that will kill
>> performance.
>>
>> On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
>>  wrote:
>>> Indeed I have. But, even when storing the offsets in Spark and committing 
>>> offsets upon completion of an output operation within the foreachRDD call 
>>> (as pointed in the example), the only offset that Spark’s Kafka 
>>> implementation commits to Kafka is the offset of the last message. For 
>>> example, if I have 100 million messages, then Spark will commit only the 
>>> 100 millionth offset, and the offsets of the intermediate batches - and 
>>> hence the questions.
>>>
 On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:

 have you read

 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself

 On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
  wrote:
> The reason why I want to obtain this information, i.e.  offset, timestamp> tuples is to relate the consumption with the 
> production rates using the __consumer_offsets Kafka internal topic. 
> Interestedly, the Spark’s KafkaConsumer implementation does not auto 
> commit the offsets upon offset commit expiration, because as seen in the 
> logs, Spark overrides the enable.auto.commit property to false.
>
> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
> mind that I do not care about exactly-once, hence having messages 
> replayed is perfectly fine.
>
>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>
>> What is it you're actually trying to accomplish?
>>
>> You can get topic, partition, and offset bounds from an offset range like
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>
>> Timestamp isn't really a meaningful idea for a range of offsets.
>>
>>
>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>  wrote:
>>> Hi all,
>>>
>>> Because the Spark Streaming direct Kafka consumer maps offsets for a 
>>> given
>>> Kafka topic and a partition internally while having enable.auto.commit 
>>> set
>>> to false, how can I retrieve the offset of each made consumer’s poll 
>>> call
>>> using the offset ranges of an RDD? More precisely, the information I 
>>> seek to
>>> get after each poll call is the following: >> partition>.
>>>
>>> Thanks in advance,
>>> Dominik
>>>
>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Why "Initial job has not accepted any resources"?

2017-04-27 Thread Yuan Fang
Here is my code. It works on local. setMaster("local[*]").
But it does not work for my remote spark cluster. I checked all logs. I did
not find any error.
It shows the following warning. Could you please help? Thank you very very
much!

14:45:47.956 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl -
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources.




object SparkPi {
  val sparkConf = new SparkConf()
.setAppName("Spark Pi")
.setMaster("spark://10.100.103.25:7077")
//.setMaster("local[*]")

  val sc = new SparkContext(sparkConf)

  def main(args: Array[String]) {
val slices =2
val n = math.min(1L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y <= 1) 1 else 0
  }.reduce(_ + _)
val pi = 4.0 * count / (n - 1)
logger.warn(s"Pi is roughly $pi")
  }
}

-- 
This message is intended exclusively for the individual or entity to which 
it is addressed. This communication may contain information that is 
proprietary, privileged or confidential or otherwise legally prohibited 
from disclosure. If you are not the named addressee, you are not authorized 
to read, print, retain, copy or disseminate this message or any part of it. 
If you have received this message in error, please notify the sender 
immediately by e-mail and delete all copies of the message.


Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Of course I am not asking to commit for every message. But instead of, seeking 
to commit the last consumed offset at a given interval. For example, from the 
1st until the 5th second, messages until offset 100.000 of the partition 10 
were consumed, then from the 6th until the 10th second of executing the last 
consumed offset of the same partition was 200.000 - and so forth. This is the 
information I seek to get. 

> On 27 Apr 2017, at 20:11, Cody Koeninger  wrote:
> 
> Are you asking for commits for every message?  Because that will kill
> performance.
> 
> On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
>  wrote:
>> Indeed I have. But, even when storing the offsets in Spark and committing 
>> offsets upon completion of an output operation within the foreachRDD call 
>> (as pointed in the example), the only offset that Spark’s Kafka 
>> implementation commits to Kafka is the offset of the last message. For 
>> example, if I have 100 million messages, then Spark will commit only the 100 
>> millionth offset, and the offsets of the intermediate batches - and hence 
>> the questions.
>> 
>>> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
>>> 
>>> have you read
>>> 
>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>> 
>>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>>>  wrote:
 The reason why I want to obtain this information, i.e.  tuples is to relate the consumption with the production rates 
 using the __consumer_offsets Kafka internal topic. Interestedly, the 
 Spark’s KafkaConsumer implementation does not auto commit the offsets upon 
 offset commit expiration, because as seen in the logs, Spark overrides the 
 enable.auto.commit property to false.
 
 Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
 mind that I do not care about exactly-once, hence having messages replayed 
 is perfectly fine.
 
> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
> 
> What is it you're actually trying to accomplish?
> 
> You can get topic, partition, and offset bounds from an offset range like
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
> 
> Timestamp isn't really a meaningful idea for a range of offsets.
> 
> 
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>  wrote:
>> Hi all,
>> 
>> Because the Spark Streaming direct Kafka consumer maps offsets for a 
>> given
>> Kafka topic and a partition internally while having enable.auto.commit 
>> set
>> to false, how can I retrieve the offset of each made consumer’s poll call
>> using the offset ranges of an RDD? More precisely, the information I 
>> seek to
>> get after each poll call is the following: > partition>.
>> 
>> Thanks in advance,
>> Dominik
>> 
 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
Are you asking for commits for every message?  Because that will kill
performance.

On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric
 wrote:
> Indeed I have. But, even when storing the offsets in Spark and committing 
> offsets upon completion of an output operation within the foreachRDD call (as 
> pointed in the example), the only offset that Spark’s Kafka implementation 
> commits to Kafka is the offset of the last message. For example, if I have 
> 100 million messages, then Spark will commit only the 100 millionth offset, 
> and the offsets of the intermediate batches - and hence the questions.
>
>> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
>>
>> have you read
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
>>
>> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>>  wrote:
>>> The reason why I want to obtain this information, i.e. >> timestamp> tuples is to relate the consumption with the production rates 
>>> using the __consumer_offsets Kafka internal topic. Interestedly, the 
>>> Spark’s KafkaConsumer implementation does not auto commit the offsets upon 
>>> offset commit expiration, because as seen in the logs, Spark overrides the 
>>> enable.auto.commit property to false.
>>>
>>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
>>> mind that I do not care about exactly-once, hence having messages replayed 
>>> is perfectly fine.
>>>
 On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:

 What is it you're actually trying to accomplish?

 You can get topic, partition, and offset bounds from an offset range like

 http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets

 Timestamp isn't really a meaningful idea for a range of offsets.


 On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
  wrote:
> Hi all,
>
> Because the Spark Streaming direct Kafka consumer maps offsets for a given
> Kafka topic and a partition internally while having enable.auto.commit set
> to false, how can I retrieve the offset of each made consumer’s poll call
> using the offset ranges of an RDD? More precisely, the information I seek 
> to
> get after each poll call is the following: .
>
> Thanks in advance,
> Dominik
>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-04-27 Thread Tim Smith
Hi,

I am trying to figure out the API to initialize a gaussian mixture model
using either centroids created by K-means or previously calculated GMM
model (I am aware that you can "save" a model and "load" in later but I am
not interested in saving a model to a filesystem).

The Spark MLlib API lets you do this using SetInitialModel
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture

However, I cannot figure out how to do this using Spark ML API. Can anyone
please point me in the right direction? I've tried reading the Spark ML
code and was wondering if the "set" call lets you do that?

--
Thanks,

Tim


Re: Calculate mode separately for multiple columns in row

2017-04-27 Thread Everett Anderson
For the curious, I played around with a UDAF for this (shown below). On the
downside, it assembles a Map of all possible values of the column that'll
need to be stored in memory somewhere.

I suspect some kind of sorted groupByKey + cogroup could stream values
through, though might not support partial aggregation, then. Will try that
next.

/**
  * [[UserDefinedAggregateFunction]] for computing the mode of a string
column.
  *
  * WARNING: This will assemble a Map of all possible values in memory.
  *
  * It'll ignore null values and return null if all values are null.
  */
class ModeAggregateFunction extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value",
StringType) :: Nil)

  override def bufferSchema: StructType = StructType(
StructField("values", MapType(StringType, LongType, valueContainsNull =
false)) :: Nil)

  override def dataType: DataType = StringType

  override def deterministic: Boolean = true

  // This is the initial value for your buffer schema.
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[String, Long]()
  }

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit =
{
if (input == null || input.getString(0) == null) {
  return
}

val value = input.getString(0)
val frequencies = buffer.getAs[Map[String, Long]](0)
val count = frequencies.getOrElse(value, 0L)

buffer(0) = frequencies + (value -> (count + 1L))
  }

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
= {
val frequencies1: Map[String, Long] = buffer1.getAs[Map[String,
Long]](0)
val frequencies2: Map[String, Long] = buffer2.getAs[Map[String,
Long]](0)

buffer1(0) = frequencies1 ++ frequencies2.map({
  case (k: String,v: Long) => k -> (v.asInstanceOf[Long] +
frequencies1.getOrElse(k, 0L))
})
  }

  override def evaluate(buffer: Row): Any = {
val frequencies = buffer.getAs[Map[String, Long]](0)
if (frequencies.isEmpty) {
  return null
}
frequencies.maxBy(_._2)._1
  }
}




On Wed, Apr 26, 2017 at 10:21 AM, Everett Anderson  wrote:

> Hi,
>
> One common situation I run across is that I want to compact my data and
> select the mode (most frequent value) in several columns for each group.
>
> Even calculating mode for one column in SQL is a bit tricky. The ways I've
> seen usually involve a nested sub-select with a group by + count and then a
> window function using rank().
>
> However, what if you want to calculate the mode for several columns,
> producing a new row with the results? And let's say the set of columns is
> only known at runtime.
>
> In Spark SQL, I start going down a road of many self-joins. The more
> efficient way leads me to either RDD[Row] or Dataset[Row] where I could do
> a groupByKey + flatMapGroups, keeping state as I iterate over the Rows in
> each group.
>
> What's the best way?
>
> Here's a contrived example:
>
> val input = spark.sparkContext.parallelize(Seq(
> ("catosaur", "black", "claws"),
> ("catosaur", "orange", "scales"),
> ("catosaur", "black", "scales"),
> ("catosaur", "orange", "scales"),
> ("catosaur", "black", "spikes"),
> ("bearcopter", "gray", "claws"),
> ("bearcopter", "black", "fur"),
> ("bearcopter", "gray", "flight"),
> ("bearcopter", "gray", "flight")))
> .toDF("creature", "color", "feature")
>
> +--+--+---+
> |creature  |color |feature|
> +--+--+---+
> |catosaur  |black |claws  |
> |catosaur  |orange|scales |
> |catosaur  |black |scales |
> |catosaur  |orange|scales |
> |catosaur  |black |spikes |
> |bearcopter|gray  |claws  |
> |bearcopter|black |fur|
> |bearcopter|gray  |flight |
> |bearcopter|gray  |flight |
> +--+--+---+
>
> val expectedOutput = spark.sparkContext.parallelize(Seq(
> ("catosaur", "black", "scales"),
> ("bearcopter", "gray", "flight")))
> .toDF("creature", "color", "feature")
>
> +--+-+---+
> |creature  |color|feature|
> +--+-+---+
> |catosaur  |black|scales |
> |bearcopter|gray |flight |
> +--+-+---+
>
>
>
>
>
>


Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
Indeed I have. But, even when storing the offsets in Spark and committing 
offsets upon completion of an output operation within the foreachRDD call (as 
pointed in the example), the only offset that Spark’s Kafka implementation 
commits to Kafka is the offset of the last message. For example, if I have 100 
million messages, then Spark will commit only the 100 millionth offset, and the 
offsets of the intermediate batches - and hence the questions. 

> On 26 Apr 2017, at 21:42, Cody Koeninger  wrote:
> 
> have you read
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself
> 
> On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
>  wrote:
>> The reason why I want to obtain this information, i.e. > timestamp> tuples is to relate the consumption with the production rates 
>> using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
>> KafkaConsumer implementation does not auto commit the offsets upon offset 
>> commit expiration, because as seen in the logs, Spark overrides the 
>> enable.auto.commit property to false.
>> 
>> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
>> mind that I do not care about exactly-once, hence having messages replayed 
>> is perfectly fine.
>> 
>>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>> 
>>> What is it you're actually trying to accomplish?
>>> 
>>> You can get topic, partition, and offset bounds from an offset range like
>>> 
>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>> 
>>> Timestamp isn't really a meaningful idea for a range of offsets.
>>> 
>>> 
>>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>>  wrote:
 Hi all,
 
 Because the Spark Streaming direct Kafka consumer maps offsets for a given
 Kafka topic and a partition internally while having enable.auto.commit set
 to false, how can I retrieve the offset of each made consumer’s poll call
 using the offset ranges of an RDD? More precisely, the information I seek 
 to
 get after each poll call is the following: .
 
 Thanks in advance,
 Dominik
 
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to create SparkSession using SparkConf?

2017-04-27 Thread kant kodali
Actually one more question along the same line. This is about .getOrCreate()
?

JavaStreamingContext doesn't seem to have a way to accept SparkSession
object so does that mean a streaming context is not required? If so, how do
I pass a lambda to .getOrCreate using SparkSession? The lambda that we
normally pass when we call StreamingContext.getOrCreate.








On Thu, Apr 27, 2017 at 8:47 AM, kant kodali  wrote:

> Ahhh Thanks much! I miss my sparkConf.setJars function instead of this
> hacky comma separated jar names.
>
> On Thu, Apr 27, 2017 at 8:01 AM, Yanbo Liang  wrote:
>
>> Could you try the following way?
>>
>> val spark = 
>> SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, 
>> b.jar").getOrCreate()
>>
>>
>> Thanks
>>
>> Yanbo
>>
>>
>> On Thu, Apr 27, 2017 at 9:21 AM, kant kodali  wrote:
>>
>>> I am using Spark 2.1 BTW.
>>>
>>> On Wed, Apr 26, 2017 at 3:22 PM, kant kodali  wrote:
>>>
 Hi All,

 I am wondering how to create SparkSession using SparkConf object?
 Although I can see that most of the key value pairs we set in SparkConf we
 can also set in SparkSession or  SparkSession.Builder however I don't see
 sparkConf.setJars which is required right? Because we want the driver jar
 to be distributed across the cluster whether we run it in client mode or
 cluster mode. so I am wondering how is this possible?

 Thanks!


>>>
>>
>


[Pyspark, Python 2.7] Executor hangup caused by Unicode error while logging uncaught exception in worker

2017-04-27 Thread Sebastian Nagel
Hi,

I've seen a hangup of a job (resp. one of the executors) if the message of an 
uncaught exception
contains bytes which cannot be properly decoded as Unicode characters. The last 
lines in the
executor logs were

PySpark worker failed with exception:
Traceback (most recent call last):
  File
"/data/1/yarn/local/usercache/ubuntu/appcache/application_1492496523387_0009/container_1492496523387_0009_01_06/pyspark.zip/pyspark/worker.py",
lin
e 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1386: 
ordinal not in range(128)

After that nothing happened for hours, no CPU used on the machine running the 
executor.
First seen with Spark on Yarn
 Spark 2.1.0, Scala 2.11.8
 Python 2.7.6
 Hadoop 2.6.0-cdh5.11.0

Reproduced with Spark 2.1.0 and Python 2.7.12 in local mode and traced down to 
this small script:
   https://gist.github.com/sebastian-nagel/310a5a5f39cc668fb71b6ace208706f7

Is this a known problem?

Of course, one may argue that the job would have been failed anyway, but a 
hang-up isn't that nice,
on Yarn it blocks resources (containers) until killed.


Thanks,
Sebastian


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to create SparkSession using SparkConf?

2017-04-27 Thread kant kodali
Ahhh Thanks much! I miss my sparkConf.setJars function instead of this
hacky comma separated jar names.

On Thu, Apr 27, 2017 at 8:01 AM, Yanbo Liang  wrote:

> Could you try the following way?
>
> val spark = 
> SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, 
> b.jar").getOrCreate()
>
>
> Thanks
>
> Yanbo
>
>
> On Thu, Apr 27, 2017 at 9:21 AM, kant kodali  wrote:
>
>> I am using Spark 2.1 BTW.
>>
>> On Wed, Apr 26, 2017 at 3:22 PM, kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I am wondering how to create SparkSession using SparkConf object?
>>> Although I can see that most of the key value pairs we set in SparkConf we
>>> can also set in SparkSession or  SparkSession.Builder however I don't see
>>> sparkConf.setJars which is required right? Because we want the driver jar
>>> to be distributed across the cluster whether we run it in client mode or
>>> cluster mode. so I am wondering how is this possible?
>>>
>>> Thanks!
>>>
>>>
>>
>


Data Skew in Dataframe Groupby - Any suggestions?

2017-04-27 Thread KhajaAsmath Mohammed
Hi,

I am working on requirement where I need to perform groupby on set of data
and find the max value on that group.

GroupBy on dataframe is resulting in skewness and job is running for quite
a long time (actually more time than in Hive and Impala for one day worth
of data).

Any suggestions on how to overcome this?

dataframe.groupBy(Constants.Datapoint.Vin,Constants.Datapoint.Utctime,Constants.Datapoint.ProviderDesc,Constants.Datapoint.Latitude,Constants.Datapoint.Longitude)

*Note: *I have added colleace and persited data into memory and disk too
still no improvement

Thanks,
Asmath.


Re: help/suggestions to setup spark cluster

2017-04-27 Thread Cody Koeninger
You can just cap the cores used per job.

http://spark.apache.org/docs/latest/spark-standalone.html

http://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling

On Thu, Apr 27, 2017 at 1:07 AM, vincent gromakowski
 wrote:
> Spark standalone is not multi tenant you need one clusters per job. Maybe
> you can try fair scheduling and use one cluster but i doubt it will be prod
> ready...
>
> Le 27 avr. 2017 5:28 AM, "anna stax"  a écrit :
>>
>> Thanks Cody,
>>
>> As I already mentioned I am running spark streaming on EC2 cluster in
>> standalone mode. Now in addition to streaming, I want to be able to run
>> spark batch job hourly and adhoc queries using Zeppelin.
>>
>> Can you please confirm that a standalone cluster is OK for this. Please
>> provide me some links to help me get started.
>>
>> Thanks
>> -Anna
>>
>> On Wed, Apr 26, 2017 at 7:46 PM, Cody Koeninger 
>> wrote:
>>>
>>> The standalone cluster manager is fine for production.  Don't use Yarn
>>> or Mesos unless you already have another need for it.
>>>
>>> On Wed, Apr 26, 2017 at 4:53 PM, anna stax  wrote:
>>> > Hi Sam,
>>> >
>>> > Thank you for the reply.
>>> >
>>> > What do you mean by
>>> > I doubt people run spark in a. Single EC2 instance, certainly not in
>>> > production I don't think
>>> >
>>> > What is wrong in having a data pipeline on EC2 that reads data from
>>> > kafka,
>>> > processes using spark and outputs to cassandra? Please explain.
>>> >
>>> > Thanks
>>> > -Anna
>>> >
>>> > On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin 
>>> > wrote:
>>> >>
>>> >> Hi Anna
>>> >>
>>> >> There are a variety of options for launching spark clusters. I doubt
>>> >> people run spark in a. Single EC2 instance, certainly not in
>>> >> production I
>>> >> don't think
>>> >>
>>> >> I don't have enough information of what you are trying to do but if
>>> >> you
>>> >> are just trying to set things up from scratch then I think you can
>>> >> just use
>>> >> EMR which will create a cluster for you and attach a zeppelin instance
>>> >> as
>>> >> well
>>> >>
>>> >>
>>> >> You can also use databricks for ease of use and very little management
>>> >> but
>>> >> you will pay a premium for that abstraction
>>> >>
>>> >>
>>> >> Regards
>>> >> Sam
>>> >> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
>>> >>>
>>> >>> I need to setup a spark cluster for Spark streaming and scheduled
>>> >>> batch
>>> >>> jobs and adhoc queries.
>>> >>> Please give me some suggestions. Can this be done in standalone mode.
>>> >>>
>>> >>> Right now we have a spark cluster in standalone mode on AWS EC2
>>> >>> running
>>> >>> spark streaming application. Can we run spark batch jobs and zeppelin
>>> >>> on the
>>> >>> same. Do we need a better resource manager like Mesos?
>>> >>>
>>> >>> Are there any companies or individuals that can help in setting this
>>> >>> up?
>>> >>>
>>> >>> Thank you.
>>> >>>
>>> >>> -Anna
>>> >
>>> >
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way?

val spark = SparkSession.builder.appName("my-application").config("spark.jars",
"a.jar, b.jar").getOrCreate()


Thanks

Yanbo


On Thu, Apr 27, 2017 at 9:21 AM, kant kodali  wrote:

> I am using Spark 2.1 BTW.
>
> On Wed, Apr 26, 2017 at 3:22 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I am wondering how to create SparkSession using SparkConf object?
>> Although I can see that most of the key value pairs we set in SparkConf we
>> can also set in SparkSession or  SparkSession.Builder however I don't see
>> sparkConf.setJars which is required right? Because we want the driver jar
>> to be distributed across the cluster whether we run it in client mode or
>> cluster mode. so I am wondering how is this possible?
>>
>> Thanks!
>>
>>
>


Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table?

On Thu, Apr 27, 2017 at 9:58 PM, Nishanth 
wrote:

> I am facing a major issue on replacement of Synonyms in my DataSet.
>
> I am trying to replace the synonym of the Brand names to its equivalent
> names.
>
> I have tried 2 methods to solve this issue.
>
> Method 1 (regexp_replace)
>
> Here i am using the regexp_replace method.
>
> Hashtable manufacturerNames = new Hashtable();
>   Enumeration names;
>   String str;
>   double bal;
>
>   manufacturerNames.put("Allen","Apex Tool Group");
>   manufacturerNames.put("Armstrong","Apex Tool Group");
>   manufacturerNames.put("Campbell","Apex Tool Group");
>   manufacturerNames.put("Lubriplate","Apex Tool Group");
>   manufacturerNames.put("Delta","Apex Tool Group");
>   manufacturerNames.put("Gearwrench","Apex Tool Group");
>   manufacturerNames.put("H.K. Porter","Apex Tool Group");
>   /*100 MORE*/
>   manufacturerNames.put("Stanco","Stanco Mfg");
>   manufacturerNames.put("Stanco","Stanco Mfg");
>   manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>   manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>
>
>
>   // Show all balances in hash table.
>   names = manufacturerNames.keys();
>   Dataset dataFileContent = 
> sqlContext.load("com.databricks.spark.csv",
> options);
>
>
>   while(names.hasMoreElements()) {
>  str = (String) names.nextElement();
>  dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).
> toString()));
>   }
>   dataFileContent.show();
>
> I got to know that the amount of data is too huge for regexp_replace so
> got a solution to use UDF
> http://stackoverflow.com/questions/43413513/issue-in-
> regex-replace-in-apache-spark-java
>
>
> Method 2 (UDF)
>
> List data2 = Arrays.asList(
> RowFactory.create("Allen", "Apex Tool Group"),
> RowFactory.create("Armstrong","Apex Tool Group"),
> RowFactory.create("DeWALT","StanleyBlack")
> );
>
> StructType schema2 = new StructType(new StructField[] {
> new StructField("label2", DataTypes.StringType, false,
> Metadata.empty()),
> new StructField("sentence2", DataTypes.StringType, false,
> Metadata.empty())
> });
> Dataset sentenceDataFrame2 = spark.createDataFrame(data2,
> schema2);
>
> UDF2 contains = new UDF2 Boolean>() {
> private static final long serialVersionUID = -5239951370238629896L;
>
> @Override
> public Boolean call(String t1, String t2) throws Exception {
> return t1.contains(t2);
> }
> };
> spark.udf().register("contains", contains, DataTypes.BooleanType);
>
> UDF3 replaceWithTerm = new
> UDF3() {
> private static final long serialVersionUID = -2882956931420910207L;
>
> @Override
> public String call(String t1, String t2, String t3) throws
> Exception {
> return t1.replaceAll(t2, t3);
> }
> };
> spark.udf().register("replaceWithTerm", replaceWithTerm,
> DataTypes.StringType);
>
> Dataset joined = sentenceDataFrame.join(sentenceDataFrame2,
> callUDF("contains", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2")))
> .withColumn("sentence_replaced",
> callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
> .select(col("sentence_replaced"));
>
> joined.show(false);
> }
>
>
> Input
>
> Allen Armstrong nishanth hemanth Allen
> shivu Armstrong nishanth
> shree shivu DeWALT
>
> Replacement of words
> The word in LHS has to replace with the words in RHS given in the input
> sentence
> Allen => Apex Tool Group
> Armstrong => Apex Tool Group
> DeWALT => StanleyBlack
>
>Output
>
>   +-+--+-+
> ---++
>   |label|sentence_replaced
>   |
>   +-+--+-+
> ---++
>   |0|Apex Tool Group Armstrong nishanth hemanth Apex Tool Group
> |
>   |0|Allen Apex Tool Group nishanth hemanth Allen
> |
>   |1|shivu Apex Tool Group nishanth
> |
>   |2|shree shivu StanleyBlack
> |
>   +-+--+-+
> ---++
>
>   Expected Output
>   +-+--+-+
> ---++
>   |label| sentence_replaced
>|
>   

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
​You can try with UDF, like the following code snippet:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
df = spark.read.text("./README.md")​
split_func = udf(lambda text: text.split(" "), ArrayType(StringType()))
df.withColumn("split_value", split_func("value")).show()

Thanks
Yanbo

On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman  wrote:

> documentDF = spark.createDataFrame([
>
> ("Hi I heard about Spark".split(" "), ),
>
> ("I wish Java could use case classes".split(" "), ),
>
> ("Logistic regression models are neat".split(" "), )
>
> ], ["text"])
>
>
> How can i achieve the same df while i am reading from source?
>
> doc = spark.read.text("/Users/rs/Desktop/nohup.out")
>
> how can i create array type with "sentences" column from
> doc(dataframe)
>
>
> The below one creates more than one column.
>
> rdd.map(lambda rdd: rdd[0]).map(lambda row:row.split(" "))
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


javaRDD to collectasMap throuwa ava.lang.NegativeArraySizeException

2017-04-27 Thread Manohar753


HI All,
getting the below Exception while converting my rdd to Map below is the
code.and my data size is hardly 200MD snappy file and the code looks like
this

@SuppressWarnings("unchecked")
public Tuple2, String> getMatchData(String 
location,
String key) {
ParquetInputFormat.setReadSupportClass(this.getJob(),
(Class>) 
(Class)
AvroReadSupport.class);
JavaPairRDD avroRDD =
this.getSparkContext().newAPIHadoopFile(location,
(Class>) 
(Class)
ParquetInputFormat.class, Void.class,
GenericRecord.class, 
this.getJob().getConfiguration());
JavaPairRDD kv = avroRDD.mapToPair(new
MapAvroToKV(key, new SpelExpressionParser()));
Schema schema = kv.first()._2().getSchema();
JavaPairRDD bytesKV = kv.mapToPair(new
AvroToBytesFunction());
Map map = bytesKV.collectAsMap();
Map hashmap = new HashMap(map);
return new Tuple2<>(hashmap, schema.toString());
}


please help me out if any thoughts and thanks in advance



04/27 03:13:18 INFO YarnClusterScheduler: Cancelling stage 11
04/27 03:13:18 INFO DAGScheduler: ResultStage 11 (collectAsMap at
DoubleClickSORJob.java:281) failed in 0.734 s due to Job aborted due to
stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 11.0 (TID 15, ip-172-31-50-58.ec2.internal, executor
1): java.io.IOException: java.lang.NegativeArraySizeException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:325)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:60)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:43)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:244)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$10.apply(TorrentBroadcast.scala:286)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:287)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:221)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
... 11 more

Driver stacktrace:
04/27 03:13:18 INFO DAGScheduler: Job 11 failed: collectAsMap at
DoubleClickSORJob.java:281, took 3.651447 s
04/27 03:13:18 ERROR ApplicationMaster: User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0
(TID 15, ip-172-31-50-58.ec2.internal, executor 1): java.io.IOException:
java.lang.NegativeArraySizeException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at 

Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Nishanth
I am facing a major issue on replacement of Synonyms in my DataSet.
I am trying to replace the synonym of the Brand names to its equivalent names.
I have tried 2 methods to solve this issue.
Method 1 (regexp_replace)
Here i am using the regexp_replace method.
 Hashtable manufacturerNames = new Hashtable();          Enumeration names;     
     String str;          double bal;
          manufacturerNames.put("Allen","Apex Tool Group");          
manufacturerNames.put("Armstrong","Apex Tool Group");          
manufacturerNames.put("Campbell","Apex Tool Group");          
manufacturerNames.put("Lubriplate","Apex Tool Group");          
manufacturerNames.put("Delta","Apex Tool Group");          
manufacturerNames.put("Gearwrench","Apex Tool Group");          
manufacturerNames.put("H.K. Porter","Apex Tool Group");          /*100 
MORE*/          manufacturerNames.put("Stanco","Stanco Mfg");          
manufacturerNames.put("Stanco","Stanco Mfg");          
manufacturerNames.put("Standard Safety","Standard Safety Equipment Company");   
       manufacturerNames.put("Standard Safety","Standard Safety Equipment 
Company");


          // Show all balances in hash table.          names = 
manufacturerNames.keys();          Dataset dataFileContent = 
sqlContext.load("com.databricks.spark.csv", options);

          while(names.hasMoreElements()) {             str = (String) 
names.nextElement();             
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
          }                  dataFileContent.show();
I got to know that the amount of data is too huge for regexp_replace so got a 
solution to use 
UDFhttp://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java

Method 2 (UDF)
List data2 = Arrays.asList(        RowFactory.create("Allen", "Apex Tool 
Group"),        RowFactory.create("Armstrong","Apex Tool Group"),        
RowFactory.create("DeWALT","StanleyBlack")    );
    StructType schema2 = new StructType(new StructField[] {        new 
StructField("label2", DataTypes.StringType, false, Metadata.empty()),        
new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())     
});    Dataset sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
    UDF2 contains = new UDF2() {        private static final long serialVersionUID = 
-5239951370238629896L;
        @Override        public Boolean call(String t1, String t2) throws 
Exception {            return t1.contains(t2);        }    };    
spark.udf().register("contains", contains, DataTypes.BooleanType);
    UDF3 replaceWithTerm = new UDF3() {        private static final long serialVersionUID = 
-2882956931420910207L;
        @Override        public String call(String t1, String t2, String t3) 
throws Exception {            return t1.replaceAll(t2, t3);        }    };    
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
    Dataset joined = sentenceDataFrame.join(sentenceDataFrame2, 
callUDF("contains", sentenceDataFrame.col("sentence"), 
sentenceDataFrame2.col("label2")))                            
.withColumn("sentence_replaced", callUDF("replaceWithTerm", 
sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), 
sentenceDataFrame2.col("sentence2")))                            
.select(col("sentence_replaced"));
    joined.show(false);}

Input
Allen Armstrong nishanth hemanth Allenshivu Armstrong nishanthshree shivu DeWALT
Replacement of wordsThe word in LHS has to replace with the words in RHS given 
in the input sentenceAllen => Apex Tool GroupArmstrong => Apex Tool GroupDeWALT 
=> StanleyBlack
       Output
      
+-+--+-+---++
      |label|sentence_replaced                                                  
 |      
+-+--+-+---++
      |0    |Apex Tool Group Armstrong nishanth hemanth Apex Tool Group         
      |      |0    |Allen Apex Tool Group nishanth hemanth Allen                
             |      |1    |shivu Apex Tool Group nishanth                       
                    |      |2    |shree shivu StanleyBlack                      
                           |      
+-+--+-+---++
      Expected Output      
+-+--+-+---++
      |label| sentence_replaced                                                 
 |      
+-+--+-+---++
      |0    |Apex Tool Group Apex Tool Group nishanth hemanth Apex Tool Group   
      |       |1    |shivu Apex Tool Group nishanth                             
       

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Jacek Laskowski
Hi Shawn,

If you're asking me if Spark SQL should optimize such queries, I don't know.

If you're asking me if it's possible to convince Spark SQL to do so, I'd
say, sure, it is. Write your optimization rule and attach it to Optimizer
(using extraOptimizations extension point).


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Thu, Apr 27, 2017 at 3:22 PM, Lavelle, Shawn 
wrote:

> Hi Jacek,
>
>
>
>  I know that it is not currently doing so, but should it be?  The
> algorithm isn’t complicated and could be applied to both OR and AND logical
> operators with comparison operators as children.
>  My users write programs to generate queries that aren’t checked for
> this sort of thing. We’re probably going to write our own
> org.apache.spark.sql.catalyst.rules.Rule to handle it.
>
> ~ Shawn
>
>
>
> *From:* Jacek Laskowski [mailto:ja...@japila.pl]
> *Sent:* Wednesday, April 26, 2017 2:55 AM
> *To:* Lavelle, Shawn 
> *Cc:* user 
> *Subject:* Re: Spark-SQL Query Optimization: overlapping ranges
>
>
>
> explain it and you'll know what happens under the covers.
>
>
>
> i.e. Use explain on the Dataset.
>
>
>
> Jacek
>
>
>
> On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" 
> wrote:
>
> Hello Spark Users!
>
>Does the Spark Optimization engine reduce overlapping column ranges?
> If so, should it push this down to a Data Source?
>
>   Example,
>
> This:  Select * from table where col between 3 and 7 OR col between 5
> and 9
>
> Reduces to:  Select * from table where col between 3 and 9
>
>
>
>
>
>   Thanks for your insight!
>
>
> ~ Shawn M Lavelle
>
>
>
>
>
> *Shawn* *Lavelle*
> Software Development
>
> 4101 Arrowhead Drive
> Medina, Minnesota 55340-9457
> Phone: 763 551 0559 <(763)%20551-0559>
> Fax: 763 551 0750 <(763)%20551-0750>
> *Email:* *shawn.lave...@osii.com *
> *Website:* *www.osii.com* 
>
>
>
>


RE: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Lavelle, Shawn
Hi Jacek,

 I know that it is not currently doing so, but should it be?  The algorithm 
isn’t complicated and could be applied to both OR and AND logical operators 
with comparison operators as children.
 My users write programs to generate queries that aren’t checked for this 
sort of thing. We’re probably going to write our own 
org.apache.spark.sql.catalyst.rules.Rule to handle it.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Wednesday, April 26, 2017 2:55 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

explain it and you'll know what happens under the covers.

i.e. Use explain on the Dataset.

Jacek

On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" 
> wrote:
Hello Spark Users!

   Does the Spark Optimization engine reduce overlapping column ranges?  If so, 
should it push this down to a Data Source?

  Example,
This:  Select * from table where col between 3 and 7 OR col between 5 and 9
Reduces to:  Select * from table where col between 3 and 9


  Thanks for your insight!

~ Shawn M Lavelle



[cid:image002.png@01D2BF2D.E0330800]
Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com




Re: Spark Testing Library Discussion

2017-04-27 Thread Sam Elamin
Hi

@Lucas I certainly would love to write an integration testing library for
workflows, I have a few ideas I would love to share with others and they
are focused around Airflow since that is what we use


As promised here
 is
the first blog post in a series of posts I hope to write on how we build
data pipelines

Please feel free to retweet my original tweet
 and share because
the more ideas we have the better!

Feedback is always welcome!

Regards
Sam

On Tue, Apr 25, 2017 at 10:32 PM, lucas.g...@gmail.com  wrote:

> Hi all, whoever (Sam I think) was going to do some work on doing a
> template testing pipeline.  I'd love to be involved, I have a current task
> in my day job (data engineer) to flesh out our testing how-to / best
> practices for Spark jobs and I think I'll be doing something very similar
> for the next week or 2.
>
> I'll scrape out what i have now in the next day or so and put it up in a
> gist that I can share too.
>
> G
>
> On 25 April 2017 at 13:04, Holden Karau  wrote:
>
>> Urgh hangouts did something frustrating, updated link
>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>>
>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau 
>> wrote:
>>
>>> The (tentative) link for those interested is https://hangouts.google.com
>>> /hangouts/_/oyjvcnffejcjhi6qazf3lysypue .
>>>
>>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau 
>>> wrote:
>>>
 So 14 people have said they are available on Tuesday the 25th at 1PM
 pacific so we will do this meeting then ( https://doodle.com/poll/69y6
 yab4pyf7u8bn ).

 Since hangouts tends to work ok on the Linux distro I'm running my
 default is to host this as a "hangouts-on-air" unless there are alternative
 ideas.

 I'll record the hangout and if it isn't terrible I'll post it for those
 who weren't able to make it (and for next time I'll include more European
 friendly time options - Doodle wouldn't let me update it once posted).

 On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau 
 wrote:

> Hi Spark Users (+ Some Spark Testing Devs on BCC),
>
> Awhile back on one of the many threads about testing in Spark there
> was some interest in having a chat about the state of Spark testing and
> what people want/need.
>
> So if you are interested in joining an online (with maybe an IRL
> component if enough people are SF based) chat about Spark testing please
> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn
>
> I think reasonable topics of discussion could be:
>
> 1) What is the state of the different Spark testing libraries in the
> different core (Scala, Python, R, Java) and extended languages (C#,
> Javascript, etc.)?
> 2) How do we make these more easily discovered by users?
> 3) What are people looking for in their testing libraries that we are
> missing? (can be functionality, documentation, etc.)
> 4) Are there any examples of well tested open source Spark projects
> and where are they?
>
> If you have other topics that's awesome.
>
> To clarify this about libraries and best practices for people testing
> their Spark applications, and less about testing Spark's internals
> (although as illustrated by some of the libraries there is some strong
> overlap in what is required to make that work).
>
> Cheers,
>
> Holden :)
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


[Spark Core] Why SetAccumulator is buried in org.apache.spark.sql.execution.debug?

2017-04-27 Thread v . chesnokov
Currently SetAccumulator (which extends AccumulatorV2[T,java.util.Set[T]]) is a 
nested class of org.apache.spark.sql.execution.debug.DebugExec. I wonder why 
this quite useful class is buried there, in spark-sql, and not presented in 
org.apache.spark.util of spark-core.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: help/suggestions to setup spark cluster

2017-04-27 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe
you can try fair scheduling and use one cluster but i doubt it will be prod
ready...

Le 27 avr. 2017 5:28 AM, "anna stax"  a écrit :

> Thanks Cody,
>
> As I already mentioned I am running spark streaming on EC2 cluster in
> standalone mode. Now in addition to streaming, I want to be able to run
> spark batch job hourly and adhoc queries using Zeppelin.
>
> Can you please confirm that a standalone cluster is OK for this. Please
> provide me some links to help me get started.
>
> Thanks
> -Anna
>
> On Wed, Apr 26, 2017 at 7:46 PM, Cody Koeninger 
> wrote:
>
>> The standalone cluster manager is fine for production.  Don't use Yarn
>> or Mesos unless you already have another need for it.
>>
>> On Wed, Apr 26, 2017 at 4:53 PM, anna stax  wrote:
>> > Hi Sam,
>> >
>> > Thank you for the reply.
>> >
>> > What do you mean by
>> > I doubt people run spark in a. Single EC2 instance, certainly not in
>> > production I don't think
>> >
>> > What is wrong in having a data pipeline on EC2 that reads data from
>> kafka,
>> > processes using spark and outputs to cassandra? Please explain.
>> >
>> > Thanks
>> > -Anna
>> >
>> > On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin 
>> wrote:
>> >>
>> >> Hi Anna
>> >>
>> >> There are a variety of options for launching spark clusters. I doubt
>> >> people run spark in a. Single EC2 instance, certainly not in
>> production I
>> >> don't think
>> >>
>> >> I don't have enough information of what you are trying to do but if you
>> >> are just trying to set things up from scratch then I think you can
>> just use
>> >> EMR which will create a cluster for you and attach a zeppelin instance
>> as
>> >> well
>> >>
>> >>
>> >> You can also use databricks for ease of use and very little management
>> but
>> >> you will pay a premium for that abstraction
>> >>
>> >>
>> >> Regards
>> >> Sam
>> >> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
>> >>>
>> >>> I need to setup a spark cluster for Spark streaming and scheduled
>> batch
>> >>> jobs and adhoc queries.
>> >>> Please give me some suggestions. Can this be done in standalone mode.
>> >>>
>> >>> Right now we have a spark cluster in standalone mode on AWS EC2
>> running
>> >>> spark streaming application. Can we run spark batch jobs and zeppelin
>> on the
>> >>> same. Do we need a better resource manager like Mesos?
>> >>>
>> >>> Are there any companies or individuals that can help in setting this
>> up?
>> >>>
>> >>> Thank you.
>> >>>
>> >>> -Anna
>> >
>> >
>>
>
>