date:20160830

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi

I think I figured it out. There is indeed "something deeper in Scala” :-)

abstract class A {
  def a: this.type
}

class AA(i: Int) extends A {
  def a = this
}
the above works ok. But if you return anything other than “this”, you will get 
a compile error.

abstract class A {
  def a: this.type
}

class AA(i: Int) extends A {
  def a = new AA(1)
}
Error:(33, 11) type mismatch;
 found   : com.dataorchard.datagears.AA
 required: AA.this.type
  def a = new AA(1)
  ^

So you have to do:

abstract class A[T <: A[T]]  {
  def a: T
}

class AA(i: Int) extends A[AA] {
  def a = new AA(1)
}


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 30, 2016, at 9:51 PM, Mohit Jaggi  wrote:
> 
> thanks Sean. I am cross posting on dev to see why the code was written that 
> way. Perhaps, this.type doesn’t do what is needed.
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com 
> 
> 
> 
> 
>> On Aug 30, 2016, at 2:08 PM, Sean Owen > > wrote:
>> 
>> I think it's imitating, for example, how Enum is delcared in Java:
>> 
>> abstract class Enum>
>> 
>> this is done so that Enum can refer to the actual type of the derived
>> enum class when declaring things like public final int compareTo(E o)
>> to implement Comparable. The type is redundant in a sense, because
>> you effectively have MyEnum extending Enum.
>> 
>> Java allows this self-referential definition. However Scala has
>> "this.type" for this purpose and (unless I'm about to learn something
>> deeper about Scala) it would have been the better way to express this
>> so that Model methods can for example state that copy() returns a
>> Model of the same concrete type.
>> 
>> I don't know if it can be changed now without breaking compatibility
>> but you're welcome to give it a shot with MiMa to see. It does
>> compile, using this.type.
>> 
>> 
>> On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi > > wrote:
>>> Folks,
>>> I am having a bit of trouble understanding the following:
>>> 
>>> abstract class Model[M <: Model[M]]
>>> 
>>> Why is M <: Model[M]?
>>> 
>>> Cheers,
>>> Mohit.
>>> 
>

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi

thanks Sean. I am cross posting on dev to see why the code was written that
way. Perhaps, this.type doesn’t do what is needed.

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




On Aug 30, 2016, at 2:08 PM, Sean Owen  wrote:

I think it's imitating, for example, how Enum is delcared in Java:

abstract class Enum>

this is done so that Enum can refer to the actual type of the derived
enum class when declaring things like public final int compareTo(E o)
to implement Comparable. The type is redundant in a sense, because
you effectively have MyEnum extending Enum.

Java allows this self-referential definition. However Scala has
"this.type" for this purpose and (unless I'm about to learn something
deeper about Scala) it would have been the better way to express this
so that Model methods can for example state that copy() returns a
Model of the same concrete type.

I don't know if it can be changed now without breaking compatibility
but you're welcome to give it a shot with MiMa to see. It does
compile, using this.type.


On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:

Folks,
I am having a bit of trouble understanding the following:

abstract class Model[M <: Model[M]]

Why is M <: Model[M]?

Cheers,
Mohit.

Iterative mapWithState

2016-08-30 Thread Matt Smith

Is is possible to use mapWithState iteratively?

In other words, I would like to keep calling mapWithState with the output
from the last mapWithState until there is no output.  For a given minibatch
mapWithState could be called anywhere from 1..200ish times depending on the
input/current state.

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

Hi Bryan,
Thanks for the reply.
I am indexing 5 columns ,then using these indexed columns to generate the
"feature" column thru vector assembler.
Which essentially means that I cannot use *fit()* directly on
"completeDataset" dataframe since it will neither have the "feature" column
and nor the 5 indexed columns.
Of-course there is a dirty way of doing this, but I am wondering if there
some optimized/intelligent approach for this.

Thanks,
Baahu

On Wed, Aug 31, 2016 at 3:30 AM, Bryan Cutler  wrote:

> You need to first fit just the VectorIndexer which returns the model, then
> add the model to the pipeline where it will only transform.
>
> val featureVectorIndexer = new VectorIndexer()
> .setInputCol("feature")
> .setOutputCol("indexedfeature")
> .setMaxCategories(180)
> .fit(completeDataset)
>
> On Tue, Aug 30, 2016 at 9:57 AM, Bahubali Jain  wrote:
>
>> Hi,
>> I had run into similar exception " java.util.NoSuchElementException: key
>> not found: " .
>> After further investigation I realized it is happening due to
>> vectorindexer being executed on training dataset and not on entire dataset.
>>
>> In the dataframe I have 5 categories , each of these have to go thru
>> stringindexer and then these are put thru a vector indexer to generate
>> feature vector.
>> What is the right way to do this, so that vector indexer can be run on
>> the entire data and not just on training data?
>>
>> Below is the current approach, as evident  VectorIndexer is being
>> generated based on the training set.
>>
>> Please Note: fit() on Vectorindexer cannot be called on entireset
>> dataframe since it doesn't have the required column(*feature *column is
>> being generated dynamically in pipeline execution)
>> How can the vectorindexer be *fit()* on the entireset?
>>
>>  val col1_indexer = new StringIndexer().setInputCol("c
>> ol1").setOutputCol("indexed_col1")
>> val col2_indexer = new StringIndexer().setInputCol("c
>> ol2").setOutputCol("indexed_col2")
>> val col3_indexer = new StringIndexer().setInputCol("c
>> ol3").setOutputCol("indexed_col3")
>> val col4_indexer = new StringIndexer().setInputCol("c
>> ol4").setOutputCol("indexed_col4")
>> val col5_indexer = new StringIndexer().setInputCol("c
>> ol5").setOutputCol("indexed_col5")
>>
>> val featureArray =  Array("indexed_col1","indexed_
>> col2","indexed_col3","indexed_col4","indexed_col5")
>> val vectorAssembler = new VectorAssembler().setInputCols
>> (featureArray).setOutputCol("*feature*")
>> val featureVectorIndexer = new VectorIndexer()
>> .setInputCol("feature")
>> .setOutputCol("indexedfeature")
>> .setMaxCategories(180)
>>
>> val decisionTree = new DecisionTreeClassifier().setMa
>> xBins(300).setMaxDepth(1).setImpurity("entropy").setLabe
>> lCol("indexed_user_action").setFeaturesCol("indexedfeature
>> ").setPredictionCol("prediction")
>>
>> val pipeline = new Pipeline().setStages(Array(col1_indexer,col2_indexer,
>> col3_indexer,col4_indexer,col5_indexer,vectorAssembler,featureVecto
>> rIndexer,decisionTree))
>> val model = pipeline.*fit(trainingSet)*
>> val output = model.transform(cvSet)
>>
>>
>> Thanks,
>> Baahu
>>
>> On Fri, Jul 8, 2016 at 11:24 PM, Bryan Cutler  wrote:
>>
>>> Hi Rich,
>>>
>>> I looked at the notebook and it seems like you are fitting the
>>> StringIndexer and VectorIndexer to only the training data, and it should
>>> the the entire data set.  So if the training data does not include all of
>>> the labels and an unknown label appears in the test data during evaluation,
>>> then it will not know how to index it.  So your code should be like this,
>>> fit with 'digits' instead of 'training'
>>>
>>> val labelIndexer = new StringIndexer().setInputCol("l
>>> abel").setOutputCol("indexedLabel").fit(digits)
>>> // Automatically identify categorical features, and index them.
>>> // Set maxCategories so features with > 4 distinct values are treated as
>>> continuous.
>>> val featureIndexer = new VectorIndexer().setInputCol("f
>>> eatures").setOutputCol("indexedFeatures").setMaxCategories(4
>>> ).fit(digits)
>>>
>>> Hope that helps!
>>>
>>> On Fri, Jul 1, 2016 at 9:24 AM, Rich Tarro  wrote:
>>>
 Hi Bryan.

 Thanks for your continued help.

 Here is the code shown in a Jupyter notebook. I figured this was easier
 that cutting and pasting the code into an email. If you  would like me to
 send you the code in a different format let, me know. The necessary data is
 all downloaded within the notebook itself.

 https://console.ng.bluemix.net/data/notebooks/fe7e578a-401f-
 4744-a318-b1b6bcf6f5f8/view?access_token=2f6df7b1dfcb3c1c2d9
 4a794506bb282729dab8f05118fafe5f11886326e02fc

 A few additional pieces of information.

 1. The training dataset is cached before training the model. If you do
 not cache the training dataset, the model will not train. The code
 model.transform(test) fails

Best way to share state in a streaming cluster

2016-08-30 Thread C. Josephson

We have a timestamped input stream and we need to share the latest
processed timestamp across spark master and slaves. This will be
monotonically increasing over time. What is the easiest way to share state
across spark machines?

An accumulator is very close to what we need, but since only the driver
program can read the accumulator’s value, it won't work. Any suggestions?

Thanks,
-C

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi

thanks Sean. I am cross posting on dev to see why the code was written that 
way. Perhaps, this.type doesn’t do what is needed.

Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Aug 30, 2016, at 2:08 PM, Sean Owen  wrote:
> 
> I think it's imitating, for example, how Enum is delcared in Java:
> 
> abstract class Enum>
> 
> this is done so that Enum can refer to the actual type of the derived
> enum class when declaring things like public final int compareTo(E o)
> to implement Comparable. The type is redundant in a sense, because
> you effectively have MyEnum extending Enum.
> 
> Java allows this self-referential definition. However Scala has
> "this.type" for this purpose and (unless I'm about to learn something
> deeper about Scala) it would have been the better way to express this
> so that Model methods can for example state that copy() returns a
> Model of the same concrete type.
> 
> I don't know if it can be changed now without breaking compatibility
> but you're welcome to give it a shot with MiMa to see. It does
> compile, using this.type.
> 
> 
> On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:
>> Folks,
>> I am having a bit of trouble understanding the following:
>> 
>> abstract class Model[M <: Model[M]]
>> 
>> Why is M <: Model[M]?
>> 
>> Cheers,
>> Mohit.
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to use custom class in DataSet

2016-08-30 Thread Jakob Odersky

Implementing custom encoders is unfortunately not well supported at
the moment (IIRC there are plans to eventually add an api for user
defined encoders).

That being said, there are a couple of encoders that can work with
generic, serializable data types: "javaSerialization" and "kryo",
found here 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Encoders$.
These encoders need to be specified explicitly, as in
"spark.createDataset(...)(Encoders.javaSerialization)"

In Spark 2.1 there will also be special trait
"org.apache.spark.sql.catalyst.DefinedByConstructorParams" that can be
mixed into arbitrary classes and that has implicit encoders available.

If you don't control the source of the class in question and it is not
serializable, it may still be possible to define your own Encoder by
implementing your own "o.a.s.sql.catalyst.encoders.ExpressionEncoder".
However, that requires quite some knowledge on how Spark's SQL
optimizer (catalyst) works internally and I don't think there is much
documentation on that.

regards,
--Jakob

On Mon, Aug 29, 2016 at 10:39 PM, canan chen  wrote:
>
> e.g. I have a custom class A (not case class), and I'd like to use it as
> DataSet[A]. I guess I need to implement Encoder for this, but didn't find
> any example for that, is there any document for that ? Thanks
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Random Forest Classification

2016-08-30 Thread Bryan Cutler

You need to first fit just the VectorIndexer which returns the model, then
add the model to the pipeline where it will only transform.

val featureVectorIndexer = new VectorIndexer()
.setInputCol("feature")
.setOutputCol("indexedfeature")
.setMaxCategories(180)
.fit(completeDataset)

On Tue, Aug 30, 2016 at 9:57 AM, Bahubali Jain  wrote:

> Hi,
> I had run into similar exception " java.util.NoSuchElementException: key
> not found: " .
> After further investigation I realized it is happening due to
> vectorindexer being executed on training dataset and not on entire dataset.
>
> In the dataframe I have 5 categories , each of these have to go thru
> stringindexer and then these are put thru a vector indexer to generate
> feature vector.
> What is the right way to do this, so that vector indexer can be run on the
> entire data and not just on training data?
>
> Below is the current approach, as evident  VectorIndexer is being
> generated based on the training set.
>
> Please Note: fit() on Vectorindexer cannot be called on entireset
> dataframe since it doesn't have the required column(*feature *column is
> being generated dynamically in pipeline execution)
> How can the vectorindexer be *fit()* on the entireset?
>
>  val col1_indexer = new StringIndexer().setInputCol("
> col1").setOutputCol("indexed_col1")
> val col2_indexer = new StringIndexer().setInputCol("
> col2").setOutputCol("indexed_col2")
> val col3_indexer = new StringIndexer().setInputCol("
> col3").setOutputCol("indexed_col3")
> val col4_indexer = new StringIndexer().setInputCol("
> col4").setOutputCol("indexed_col4")
> val col5_indexer = new StringIndexer().setInputCol("
> col5").setOutputCol("indexed_col5")
>
> val featureArray =  Array("indexed_col1","indexed_
> col2","indexed_col3","indexed_col4","indexed_col5")
> val vectorAssembler = new VectorAssembler().setInputCols(featureArray).
> setOutputCol("*feature*")
> val featureVectorIndexer = new VectorIndexer()
> .setInputCol("feature")
> .setOutputCol("indexedfeature")
> .setMaxCategories(180)
>
> val decisionTree = new DecisionTreeClassifier().
> setMaxBins(300).setMaxDepth(1).setImpurity("entropy").
> setLabelCol("indexed_user_action").setFeaturesCol("indexedfeature").
> setPredictionCol("prediction")
>
> val pipeline = new Pipeline().setStages(Array(col1_indexer,col2_indexer,
> col3_indexer,col4_indexer,col5_indexer,vectorAssembler,
> featureVectorIndexer,decisionTree))
> val model = pipeline.*fit(trainingSet)*
> val output = model.transform(cvSet)
>
>
> Thanks,
> Baahu
>
> On Fri, Jul 8, 2016 at 11:24 PM, Bryan Cutler  wrote:
>
>> Hi Rich,
>>
>> I looked at the notebook and it seems like you are fitting the
>> StringIndexer and VectorIndexer to only the training data, and it should
>> the the entire data set.  So if the training data does not include all of
>> the labels and an unknown label appears in the test data during evaluation,
>> then it will not know how to index it.  So your code should be like this,
>> fit with 'digits' instead of 'training'
>>
>> val labelIndexer = new StringIndexer().setInputCol("l
>> abel").setOutputCol("indexedLabel").fit(digits)
>> // Automatically identify categorical features, and index them.
>> // Set maxCategories so features with > 4 distinct values are treated as
>> continuous.
>> val featureIndexer = new VectorIndexer().setInputCol("f
>> eatures").setOutputCol("indexedFeatures").setMaxCategories(4).fit(digits)
>>
>> Hope that helps!
>>
>> On Fri, Jul 1, 2016 at 9:24 AM, Rich Tarro  wrote:
>>
>>> Hi Bryan.
>>>
>>> Thanks for your continued help.
>>>
>>> Here is the code shown in a Jupyter notebook. I figured this was easier
>>> that cutting and pasting the code into an email. If you  would like me to
>>> send you the code in a different format let, me know. The necessary data is
>>> all downloaded within the notebook itself.
>>>
>>> https://console.ng.bluemix.net/data/notebooks/fe7e578a-401f-
>>> 4744-a318-b1b6bcf6f5f8/view?access_token=2f6df7b1dfcb3c1c2
>>> d94a794506bb282729dab8f05118fafe5f11886326e02fc
>>>
>>> A few additional pieces of information.
>>>
>>> 1. The training dataset is cached before training the model. If you do
>>> not cache the training dataset, the model will not train. The code
>>> model.transform(test) fails with a similar error. No other changes besides
>>> caching or not caching. Again, with the training dataset cached, the model
>>> can be successfully trained as seen in the notebook.
>>>
>>> 2. I have another version of the notebook where I download the same data
>>> in libsvm format rather than csv. That notebook works fine. All the code is
>>> essentially the same accounting for the difference in file formats.
>>>
>>> 3. I tested this same code on another Spark cloud platform and it
>>> displays the same symptoms when run there.
>>>
>>> Thanks.
>>> Rich
>>>
>>>
>>> On Wed, Jun 29, 2016 at 12:59 AM, Bryan Cutler 
>>>

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread ayan guha

Given Record Service is yet to be added to main distributions, I believe
the only available solution now is to use hdfs acl to restrict access for
spark.
On 31 Aug 2016 03:07, "Mich Talebzadeh"  wrote:

> Have you checked using views in Hive to restrict user access to certain
> tables and columns only.
>
> Have a look at this link
> 
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 16:26, Deepak Sharma  wrote:
>
>> Is it possible to execute any query using SQLContext even if the DB is
>> secured using roles or tools such as Sentry?
>>
>> Thanks
>> Deepak
>>
>> On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan > > wrote:
>>
>>> Hi All,
>>>
>>> In our YARN cluster, we have setup spark 1.6.1 , we plan to give access
>>> to all the end users/developers/BI users, etc. But we learnt any valid user
>>> after getting their own user kerb TGT, can get hold of sqlContext (in
>>> program or in shell) and can run any query against any secure databases.
>>>
>>> This puts us in a critical condition as we do not want to give blanket
>>> permission to everyone.
>>>
>>>
>>>
>>> We are looking forward to:
>>>
>>> 1)  A *solution or a work around, by which we can give secure
>>> access only to the selected users to sensitive tables/database.*
>>>
>>> 2)  *Failing to do so, we would like to remove/disable the SparkSQL
>>> context/feature for everyone.  *
>>>
>>>
>>>
>>> Any pointers in this direction will be very valuable.
>>>
>>> Thank you,
>>>
>>> Arpan
>>>
>>>
>>> This e-mail and any attachments are confidential, intended only for the 
>>> addressee and may be privileged. If you have received this e-mail in error, 
>>> please notify the sender immediately and delete it. Any content that does 
>>> not relate to the business of Worldpay is personal to the sender and not 
>>> authorised or endorsed by Worldpay. Worldpay does not accept responsibility 
>>> for viruses or any loss or damage arising from transmission or access.
>>>
>>> Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority 
>>> No: 530923), Worldpay Limited (Company No:03424752 / Financial Conduct 
>>> Authority No: 504504), Worldpay AP Limited (Company No: 05593466 / 
>>> Financial Conduct Authority No: 502597). Registered Office: The Walbrook 
>>> Building, 25 Walbrook, London EC4N 8AF and authorised by the Financial 
>>> Conduct Authority under the Payment Service Regulations 2009 for the 
>>> provision of payment services. Worldpay (UK) Limited is authorised and 
>>> regulated by the Financial Conduct Authority for consumer credit 
>>> activities. Worldpay B.V. (WPBV) has its registered office in Amsterdam, 
>>> the Netherlands (Handelsregister KvK no. 60494344). WPBV holds a licence 
>>> from and is included in the register kept by De Nederlandsche Bank, which 
>>> registration can be consulted through www.dnb.nl. Worldpay, the logo and 
>>> any associated brand names are trade marks of the Worldpay group.
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>

Re: reuse the Spark SQL internal metrics

2016-08-30 Thread Jacek Laskowski

Hi,

If the stats are in web UI, they should be flying over the wire and so
you can catch the events by implementing SparkListener [1] -- a
developer API for custom Spark listeners. That's how web UI gets the
data and History Server. I think the stats are sent as accumulator
updates in onExecutorMetricsUpdate.

[1] 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 30, 2016 at 11:17 PM, Ai Deng  wrote:
> Hi there,
>
> I think the metrics inside of the different SparkPlan (like "numOutputRows"
> in FilterExec) are useful to build any Dev dashboard or business monitoring.
> Are there a easy way or exist solution to expose and persist these metrics
> out of Spark UI (ex: send to Graphite)? Currently they are all "private"
> inside Spark library.
>
> And the main benefit is you can get these metrics for "free" without change
> your Spark application.
>
> Thanks and regards,
>
> Ai
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/reuse-the-Spark-SQL-internal-metrics-tp27626.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

reuse the Spark SQL internal metrics

2016-08-30 Thread Ai Deng

Hi there,

I think the metrics inside of the different SparkPlan (like "numOutputRows"
in FilterExec) are useful to build any Dev dashboard or business monitoring.
Are there a easy way or exist solution to expose and persist these metrics
out of Spark UI (ex: send to Graphite)? Currently they are all "private"
inside Spark library.

And the main benefit is you can get these metrics for "free" without change
your Spark application.

Thanks and regards,

Ai



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/reuse-the-Spark-SQL-internal-metrics-tp27626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Model abstract class in spark ml

2016-08-30 Thread Sean Owen

I think it's imitating, for example, how Enum is delcared in Java:

abstract class Enum>

this is done so that Enum can refer to the actual type of the derived
enum class when declaring things like public final int compareTo(E o)
to implement Comparable. The type is redundant in a sense, because
you effectively have MyEnum extending Enum.

Java allows this self-referential definition. However Scala has
"this.type" for this purpose and (unless I'm about to learn something
deeper about Scala) it would have been the better way to express this
so that Model methods can for example state that copy() returns a
Model of the same concrete type.

I don't know if it can be changed now without breaking compatibility
but you're welcome to give it a shot with MiMa to see. It does
compile, using this.type.

On Tue, Aug 30, 2016 at 9:47 PM, Mohit Jaggi  wrote:
> Folks,
> I am having a bit of trouble understanding the following:
>
> abstract class Model[M <: Model[M]]
>
> Why is M <: Model[M]?
>
> Cheers,
> Mohit.
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi

Folks,
I am having a bit of trouble understanding the following:

abstract class Model[M <: Model[M]]

Why is M <: Model[M]?

Cheers,
Mohit.

Re: Dynamic Allocation & Spark Streaming

2016-08-30 Thread Liren Ding

It's has been a while since last update on the thread. Now Spark 2.0 is
available, do you guys know if there's any progress on Dynamic Allocation &
Spark Streaming?

On Mon, Oct 19, 2015 at 1:13 PM, robert towne 
wrote:

> I have watched a few videos from Databricks/Andrew Or around the Spark 1.2
> release and it seemed that dynamic allocation was not yet available for
> Spark Streaming.
>
> I now see SPARK-10955  
> which
> is tied to 1.5.2 and allows disabling of Spark Streaming with dynamic
> allocation.
>
> I use Spark Streaming with a receiverless/direct Kafka connection.  When I
> start up an app reading from the beginning of the topic I would like to
> have more resources than once I have caught up.  Is it possible to use
> dynamic allocation for this use case?
>
> thanks,
> Robert
>

Spark build 1.6.2 error

2016-08-30 Thread Diwakar Dhanuskodi

Hi,

While building Spark 1.6.2 , getting below error in spark-sql. Much
appreciate for any help.

ERROR] missing or invalid dependency detected while loading class file
'WebUI.class'.
Could not access term eclipse in package org,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an
incompatible version of org.
[ERROR] missing or invalid dependency detected while loading class file
'WebUI.class'.
Could not access term jetty in value org.eclipse,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an
incompatible version of org.eclipse.
[WARNING] 17 warnings found
[ERROR] two errors found
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [4.399s]
[INFO] Spark Project Test Tags ... SUCCESS [3.443s]
[INFO] Spark Project Launcher  SUCCESS [10.131s]
[INFO] Spark Project Networking .. SUCCESS [11.849s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.641s]
[INFO] Spark Project Unsafe .. SUCCESS [19.765s]
[INFO] Spark Project Core  SUCCESS
[4:16.511s]
[INFO] Spark Project Bagel ... SUCCESS [13.401s]
[INFO] Spark Project GraphX .. SUCCESS
[1:08.824s]
[INFO] Spark Project Streaming ... SUCCESS
[2:18.844s]
[INFO] Spark Project Catalyst  SUCCESS
[2:43.695s]
[INFO] Spark Project SQL . FAILURE
[1:01.762s]
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project Docker Integration Tests  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Shuffle Service  SKIPPED
[INFO] Spark Project YARN  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External Flume Assembly . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External MQTT Assembly .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] Spark Project External Kafka Assembly . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 12:40.525s
[INFO] Finished at: Wed Aug 31 01:56:50 IST 2016
[INFO] Final Memory: 71M/830M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
on project spark-sql_2.11: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
-> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-sql_2.11

Re: Issues with Spark On Hbase Connector and versions

2016-08-30 Thread Weiqing Yang

The PR will be reviewed soon.

Thanks,
Weiqing

From: Sachin Jain >
Date: Sunday, August 28, 2016 at 11:12 PM
To: spats >
Cc: user >
Subject: Re: Issues with Spark On Hbase Connector and versions

There is connection leak problem with hortonworks hbase connector if you use 
hbase 1.2.0.
I tried to use hortonwork's connector and felt into the same problem.

Have a look at this Hbase issue HBASE-16017 [0]. The fix for this was 
backported to 1.3.0, 1.4.0 and 2.0.0
I have raised a ticket on their github repo [1] and also generated PR to get 
this fixed. Check this out [2].

But unfortunately no one has responded to it yet so it is not merged. But this 
fix works and I am currently using the same without problems.
So if you want to use this. You can use this one [3] where changes for pull 
request already there.

Hope it helps!!

[0]: https://issues.apache.org/jira/browse/HBASE-16017
[1]: https://github.com/hortonworks-spark/shc/issues/19
[2]: https://github.com/hortonworks-spark/shc/pull/20
[3]: https://github.com/sachinjain024/shc/tree/Issue-19-Connection-Leak

PS: Cross posting my answer from hbase user mailing list because I think it may 
be helpful to other readers.

On Sat, Aug 27, 2016 at 5:17 PM, spats 
> wrote:
Regarding hbase connector by hortonworks
https://github.com/hortonworks-spark/shc, it would be great if someone can
answer these

1. What versions of Hbase & Spark expected? I could not run examples
provided using spark 1.6.0 & hbase 1.2.0
2. I get error when i run example provided  here

, any pointers on what i am doing wrong?

looks like spark not reading hbase-site.xml, but passed it in --files while
spark-shell
e.g --files
/etc/hbase/conf/hbase-site.xml,/etc/hbase/conf/hdfs-site.xml,/etc/hbase/conf/core-site.xml

error
16/08/27 12:35:00 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-Spark-On-Hbase-Connector-and-versions-tp27610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

newlines inside csv quoted values

2016-08-30 Thread Koert Kuipers

i noticed much to my surprise that spark csv supports newlines inside
quoted values. ok thats cool

but how does this work with splitting files when reading? i assume
splitting is still simply done on newlines or something similar. wouldnt
that potentially split in the middle of a record?

Re: S3A + EMR failure when writing Parquet?

2016-08-30 Thread Steve Loughran

On 29 Aug 2016, at 18:18, Everett Anderson
> wrote:

Okay, I don't think it's really just S3A issue, anymore. I can run the job
using fs.s3.impl/spark.hadoop.fs.s3.impl set to the S3A impl as a --conf param
from the EMR console successfully, as well.

The problem seems related to the fact that we're trying to spark-submit jobs to
a YARN cluster from outside the cluster itself.

The docs suggest one
must copy the Hadoop/YARN config XML outside of the cluster to do this, which
feels gross, but it's what we did. We had changed fs.s3.impl to use S3A in that
config, and that seems to result in the failure, though I still can't figure
out why.

Interestingly, if I don't make that change to the XML, and leave it as the
EMRFS implementation, it will work, as long as I use s3a:// URIs for the jar,
otherwise spark-submit won't be able to ship them to the cluster since it won't
have the EMRFS implementation locally.

I see: you are trying to use EMR's "special" S3 in-cluster, but spark-submit is
trying to submit remotely.

1. Trying to change the value of fs.s3.impl to S3a works for upload, but not
runtime
2. use s3a for the upload, leave things alone and all works.

I would just go with S3a, this is just the JARs being discussed here right —not
the actual data?

When the JARs are needed, they'll be copied on EMR using the amazon S3A
implementation —whatever they've done there— to the local filesystem, where
classloaders can pick them up and use. It might be that s3a:// URLs are slower
on EMR than s3:// URLs, but there's no fundamental reason wny it isn't going to
work.

On Sun, Aug 28, 2016 at 4:19 PM, Everett Anderson
> wrote:
(Sorry, typo -- I was using
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 not 'hadooop',
of course)

On Sun, Aug 28, 2016 at 12:51 PM, Everett Anderson
> wrote:
Hi,

I'm having some trouble figuring out a failure when using S3A when writing a
DataFrame as Parquet on EMR 4.7.2 (which is Hadoop 2.7.2 and Spark 1.6.2). It
works when using EMRFS (s3://), though.

I'm using these extra conf params, though I've also tried without everything
but the encryption one with the same result:

--conf spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
--conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256
--conf
spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter

It looks like it does actually write the parquet shards under

/_temporary/0/_temporary//

but then must hit that S3 exception when trying to copy/rename. I think the
NullPointerException deep down in Parquet is due to it causing close() more
than once so isn't the root cause, but I'm not sure.

given the stack trace has abortTask() in it, I'd suspect that's a follow-on
failure.

One possibility here may be related to how EMR will handle your credentials
(session credentials served up over IAM HTTP) and how Apache Hadoop 2.7's s3a
auth works (IAM isn't supported until 2.8). That could trigger the problem. But
I don't know.

I do know that I have dataframes writing back to s3a on Hadoop 2.7.3, *not on
EMR*.

Anyone seen something like this?

16/08/28 19:46:28 ERROR InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in
stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID
54,
ip-10-8-38-103.us-west-2.computk.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:266)
... 8 more
Suppressed: java.lang.NullPointerException
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$abortTask$1(WriterContainer.scala:290)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$2.apply$mcV$sp(WriterContainer.scala:266)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1286)
... 9 more

probable bug in parquet cleanup if it never started right...you

Re: Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-30 Thread Steve Loughran


On 30 Aug 2016, at 06:20, Aseem Bansal 
> wrote:

So what all access are needed? Asking this as I need to ask someone to give me 
appropriate access and I cannot just ask them to give me all access to the 
bucket.



Commented on the JIRA in general

Regarding S3a access, Spark supports reading read only repositories and reading 
and writing repositories to which the user has been given full read/write 
access. More complex S3 access controls (e.g. permissions on the root path, 
write-only, etc, etc) aren't there.

How to convert an ArrayType to DenseVector within DataFrame?

2016-08-30 Thread evanzamir

I have a DataFrame with a column containing a list of numeric features to be
used for a regression. When I run the regression, I get the following error:

*pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column
features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but
was actually ArrayType(DoubleType,true).'
*
It would be nice if Spark could automatically convert the type, but assuming
that isn't possible, what's the easiest way for me to do the conversion?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-an-ArrayType-to-DenseVector-within-DataFrame-tp27625.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Does Spark on YARN inherit or replace the Hadoop/YARN configs?

2016-08-30 Thread Everett Anderson

Hi,

I've had a bit of trouble getting Spark on YARN to work. When executing in
this mode and submitting from outside the cluster, one must set
HADOOP_CONF_DIR or YARN_CONF_DIR
, from which
spark-submit can find the params it needs to locate and talk to the YARN
application manager.

However, Spark also packages up all the Hadoop+YARN config files, ships
them to the cluster, and then uses them there.

Does it only override settings on the cluster using those shipped files? Or
does it use those entirely instead of the config the cluster already has?

My impression is that it currently replaces rather than overrides, which
means you can't construct a minimal client-side Hadoop/YARN config with
only the properties necessary to find the cluster. Is that right?

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Mich Talebzadeh

Have you checked using views in Hive to restrict user access to certain
tables and columns only.

Have a look at this link


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 16:26, Deepak Sharma  wrote:

> Is it possible to execute any query using SQLContext even if the DB is
> secured using roles or tools such as Sentry?
>
> Thanks
> Deepak
>
> On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan 
> wrote:
>
>> Hi All,
>>
>> In our YARN cluster, we have setup spark 1.6.1 , we plan to give access
>> to all the end users/developers/BI users, etc. But we learnt any valid user
>> after getting their own user kerb TGT, can get hold of sqlContext (in
>> program or in shell) and can run any query against any secure databases.
>>
>> This puts us in a critical condition as we do not want to give blanket
>> permission to everyone.
>>
>>
>>
>> We are looking forward to:
>>
>> 1)  A *solution or a work around, by which we can give secure access
>> only to the selected users to sensitive tables/database.*
>>
>> 2)  *Failing to do so, we would like to remove/disable the SparkSQL
>> context/feature for everyone.  *
>>
>>
>>
>> Any pointers in this direction will be very valuable.
>>
>> Thank you,
>>
>> Arpan
>>
>>
>> This e-mail and any attachments are confidential, intended only for the 
>> addressee and may be privileged. If you have received this e-mail in error, 
>> please notify the sender immediately and delete it. Any content that does 
>> not relate to the business of Worldpay is personal to the sender and not 
>> authorised or endorsed by Worldpay. Worldpay does not accept responsibility 
>> for viruses or any loss or damage arising from transmission or access.
>>
>> Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
>> 530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
>> No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
>> Authority No: 502597). Registered Office: The Walbrook Building, 25 
>> Walbrook, London EC4N 8AF and authorised by the Financial Conduct Authority 
>> under the Payment Service Regulations 2009 for the provision of payment 
>> services. Worldpay (UK) Limited is authorised and regulated by the Financial 
>> Conduct Authority for consumer credit activities. Worldpay B.V. (WPBV) has 
>> its registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
>> 60494344). WPBV holds a licence from and is included in the register kept by 
>> De Nederlandsche Bank, which registration can be consulted through 
>> www.dnb.nl. Worldpay, the logo and any associated brand names are trade 
>> marks of the Worldpay group.
>>
>>
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

Hi,
I had run into similar exception " java.util.NoSuchElementException: key
not found: " .
After further investigation I realized it is happening due to vectorindexer
being executed on training dataset and not on entire dataset.

In the dataframe I have 5 categories , each of these have to go thru
stringindexer and then these are put thru a vector indexer to generate
feature vector.
What is the right way to do this, so that vector indexer can be run on the
entire data and not just on training data?

Below is the current approach, as evident  VectorIndexer is being generated
based on the training set.

Please Note: fit() on Vectorindexer cannot be called on entireset
dataframe since it doesn't have the required column(*feature *column is
being generated dynamically in pipeline execution)
How can the vectorindexer be *fit()* on the entireset?

 val col1_indexer = new
StringIndexer().setInputCol("col1").setOutputCol("indexed_col1")
val col2_indexer = new
StringIndexer().setInputCol("col2").setOutputCol("indexed_col2")
val col3_indexer = new
StringIndexer().setInputCol("col3").setOutputCol("indexed_col3")
val col4_indexer = new
StringIndexer().setInputCol("col4").setOutputCol("indexed_col4")
val col5_indexer = new
StringIndexer().setInputCol("col5").setOutputCol("indexed_col5")

val featureArray =
Array("indexed_col1","indexed_col2","indexed_col3","indexed_col4","indexed_col5")
val vectorAssembler = new
VectorAssembler().setInputCols(featureArray).setOutputCol("*feature*")
val featureVectorIndexer = new VectorIndexer()
.setInputCol("feature")
.setOutputCol("indexedfeature")
.setMaxCategories(180)

val decisionTree = new
DecisionTreeClassifier().setMaxBins(300).setMaxDepth(1).setImpurity("entropy").setLabelCol("indexed_user_action").setFeaturesCol("indexedfeature").setPredictionCol("prediction")


val pipeline = new Pipeline().setStages(Array(col1_indexer,col2_indexer,
col3_indexer,col4_indexer,col5_indexer,
vectorAssembler,featureVectorIndexer,decisionTree))
val model = pipeline.*fit(trainingSet)*
val output = model.transform(cvSet)


Thanks,
Baahu

On Fri, Jul 8, 2016 at 11:24 PM, Bryan Cutler  wrote:

> Hi Rich,
>
> I looked at the notebook and it seems like you are fitting the
> StringIndexer and VectorIndexer to only the training data, and it should
> the the entire data set.  So if the training data does not include all of
> the labels and an unknown label appears in the test data during evaluation,
> then it will not know how to index it.  So your code should be like this,
> fit with 'digits' instead of 'training'
>
> val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("
> indexedLabel").fit(digits)
> // Automatically identify categorical features, and index them.
> // Set maxCategories so features with > 4 distinct values are treated as
> continuous.
> val featureIndexer = new VectorIndexer().setInputCol("
> features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(digits)
>
> Hope that helps!
>
> On Fri, Jul 1, 2016 at 9:24 AM, Rich Tarro  wrote:
>
>> Hi Bryan.
>>
>> Thanks for your continued help.
>>
>> Here is the code shown in a Jupyter notebook. I figured this was easier
>> that cutting and pasting the code into an email. If you  would like me to
>> send you the code in a different format let, me know. The necessary data is
>> all downloaded within the notebook itself.
>>
>> https://console.ng.bluemix.net/data/notebooks/fe7e578a-
>> 401f-4744-a318-b1b6bcf6f5f8/view?access_token=
>> 2f6df7b1dfcb3c1c2d94a794506bb282729dab8f05118fafe5f11886326e02fc
>>
>> A few additional pieces of information.
>>
>> 1. The training dataset is cached before training the model. If you do
>> not cache the training dataset, the model will not train. The code
>> model.transform(test) fails with a similar error. No other changes besides
>> caching or not caching. Again, with the training dataset cached, the model
>> can be successfully trained as seen in the notebook.
>>
>> 2. I have another version of the notebook where I download the same data
>> in libsvm format rather than csv. That notebook works fine. All the code is
>> essentially the same accounting for the difference in file formats.
>>
>> 3. I tested this same code on another Spark cloud platform and it
>> displays the same symptoms when run there.
>>
>> Thanks.
>> Rich
>>
>>
>> On Wed, Jun 29, 2016 at 12:59 AM, Bryan Cutler  wrote:
>>
>>> Are you fitting the VectorIndexer to the entire data set and not just
>>> training or test data?  If you are able to post your code and some data to
>>> reproduce, that would help in troubleshooting.
>>>
>>> On Tue, Jun 28, 2016 at 4:40 PM, Rich Tarro  wrote:
>>>
 Thanks for the response, but in my case I reversed the meaning of
 "prediction" and "predictedLabel". It seemed to make more sense to me that
 way, but in retrospect, it probably only causes confusion to anyone else
 looking at

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma

Is it possible to execute any query using SQLContext even if the DB is
secured using roles or tools such as Sentry?

Thanks
Deepak

On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan 
wrote:

> Hi All,
>
> In our YARN cluster, we have setup spark 1.6.1 , we plan to give access to
> all the end users/developers/BI users, etc. But we learnt any valid user
> after getting their own user kerb TGT, can get hold of sqlContext (in
> program or in shell) and can run any query against any secure databases.
>
> This puts us in a critical condition as we do not want to give blanket
> permission to everyone.
>
>
>
> We are looking forward to:
>
> 1)  A *solution or a work around, by which we can give secure access
> only to the selected users to sensitive tables/database.*
>
> 2)  *Failing to do so, we would like to remove/disable the SparkSQL
> context/feature for everyone.  *
>
>
>
> Any pointers in this direction will be very valuable.
>
> Thank you,
>
> Arpan
>
>
> This e-mail and any attachments are confidential, intended only for the 
> addressee and may be privileged. If you have received this e-mail in error, 
> please notify the sender immediately and delete it. Any content that does not 
> relate to the business of Worldpay is personal to the sender and not 
> authorised or endorsed by Worldpay. Worldpay does not accept responsibility 
> for viruses or any loss or damage arising from transmission or access.
>
> Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
> 530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
> No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
> Authority No: 502597). Registered Office: The Walbrook Building, 25 Walbrook, 
> London EC4N 8AF and authorised by the Financial Conduct Authority under the 
> Payment Service Regulations 2009 for the provision of payment services. 
> Worldpay (UK) Limited is authorised and regulated by the Financial Conduct 
> Authority for consumer credit activities. Worldpay B.V. (WPBV) has its 
> registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
> 60494344). WPBV holds a licence from and is included in the register kept by 
> De Nederlandsche Bank, which registration can be consulted through 
> www.dnb.nl. Worldpay, the logo and any associated brand names are trade marks 
> of the Worldpay group.
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

ApacheCon Seville CFP closes September 9th

2016-08-30 Thread Rich Bowen

It's traditional. We wait for the last minute to get our talk proposals
in for conferences.

Well, the last minute has arrived. The CFP for ApacheCon Seville closes
on September 9th, which is less than 2 weeks away. It's time to get your
talks in, so that we can make this the best ApacheCon yet.

It's also time to discuss with your developer and user community whether
there's a track of talks that you might want to propose, so that you
have more complete coverage of your project than a talk or two.

For Apache Big Data, the relevant URLs are:
Event details:
http://events.linuxfoundation.org/events/apache-big-data-europe
CFP:
http://events.linuxfoundation.org/events/apache-big-data-europe/program/cfp

For ApacheCon Europe, the relevant URLs are:
Event details: http://events.linuxfoundation.org/events/apachecon-europe
CFP: http://events.linuxfoundation.org/events/apachecon-europe/program/cfp

This year, we'll be reviewing papers "blind" - that is, looking at the
abstracts without knowing who the speaker is. This has been shown to
eliminate the "me and my buddies" nature of many tech conferences,
producing more diversity, and more new speakers. So make sure your
abstracts clearly explain what you'll be talking about.

For further updated about ApacheCon, follow us on Twitter, @ApacheCon,
or drop by our IRC channel, #apachecon on the Freenode IRC network.

-- 
Rich Bowen
WWW: http://apachecon.com/
Twitter: @ApacheCon

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Does it has a way to config limit in query on STS by default?

2016-08-30 Thread Chen Song

I tried both of the following with STS but neither works for me.

Starting STS with --hiveconf hive.limit.optimize.fetch.max=50

and

Setting common.max_count in Zeppelin

Without setting such limits, a query that outputs lots of rows could cause
the driver to OOM and makes TS unusable. Any workarounds or thoughts?


On Tue, Aug 2, 2016 at 7:29 AM Mich Talebzadeh 
wrote:

> I don't think it really works and it is vague. Is it rows, blocks, network?
>
>
>
> [image: Inline images 1]
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 12:09, Chanh Le  wrote:
>
>> Hi Ayan,
>> You mean
>> common.max_count = 1000
>> Max number of SQL result to *display to prevent the browser overload*.
>> This is common properties for all connections
>>
>>
>>
>>
>> It already set default in Zeppelin but I think it doesn’t work with Hive.
>>
>>
>> DOC: http://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/interpreter/jdbc.html
>>
>>
>> On Aug 2, 2016, at 6:03 PM, ayan guha  wrote:
>>
>> Zeppelin already has a param for jdbc
>> On 2 Aug 2016 19:50, "Mich Talebzadeh"  wrote:
>>
>>> Ok I have already set up mine
>>>
>>> 
>>> hive.limit.optimize.fetch.max
>>> 5
>>> 
>>>   Maximum number of rows allowed for a smaller subset of data for
>>> simple LIMIT, if it is a fetch query.
>>>   Insert queries are not restricted by this limit.
>>> 
>>>   
>>>
>>> I am surprised that yours was missing. What did you set it up to?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 August 2016 at 10:18, Chanh Le  wrote:
>>>
 I tried and it works perfectly.

 Regards,
 Chanh


 On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh 
 wrote:

 OK

 Try that

 Another tedious way is to create views in Hive based on tables and use
 limit on those views.

 But try that parameter first if it does anything.

 HTH


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 2 August 2016 at 09:13, Chanh Le  wrote:

> Hi Mich,
> I use Spark Thrift Server basically it acts like Hive.
>
> I see that there is property in Hive.
>
> hive.limit.optimize.fetch.max
>
>- Default Value: 5
>- Added In: Hive 0.8.0
>
> Maximum number of rows allowed for a smaller subset of data for simple
> LIMIT, if it is a fetch query. Insert queries are not restricted by this
> limit.
>
>
> Is that related to the problem?
>
>
>
>
> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh 
> wrote:
>
> This is a classic problem on any RDBMS
>
> Set the limit on the number of rows returned like maximum of 50K rows
> through JDBC
>
> What is your JDBC connection going to? Meaning which RDBMS if any?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>

Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Rajani, Arpan

Hi All,
In our YARN cluster, we have setup spark 1.6.1 , we plan to give access to all 
the end users/developers/BI users, etc. But we learnt any valid user after 
getting their own user kerb TGT, can get hold of sqlContext (in program or in 
shell) and can run any query against any secure databases.
This puts us in a critical condition as we do not want to give blanket 
permission to everyone.

We are looking forward to:

1)  A solution or a work around, by which we can give secure access only to 
the selected users to sensitive tables/database.

2)  Failing to do so, we would like to remove/disable the SparkSQL 
context/feature for everyone.

Any pointers in this direction will be very valuable.
Thank you,
Arpan
This e-mail and any attachments are confidential, intended only for the 
addressee and may be privileged. If you have received this e-mail in error, 
please notify the sender immediately and delete it. Any content that does not 
relate to the business of Worldpay is personal to the sender and not authorised 
or endorsed by Worldpay. Worldpay does not accept responsibility for viruses or 
any loss or damage arising from transmission or access.

Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
Authority No: 502597). Registered Office: The Walbrook Building, 25 Walbrook, 
London EC4N 8AF and authorised by the Financial Conduct Authority under the 
Payment Service Regulations 2009 for the provision of payment services. 
Worldpay (UK) Limited is authorised and regulated by the Financial Conduct 
Authority for consumer credit activities. Worldpay B.V. (WPBV) has its 
registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
60494344). WPBV holds a licence from and is included in the register kept by De 
Nederlandsche Bank, which registration can be consulted through www.dnb.nl. 
Worldpay, the logo and any associated brand names are trade marks of the 
Worldpay group.

Re: broadcast fails on join

2016-08-30 Thread Takeshi Yamamuro

Hi,

How about making the value of `spark.sql.broadcastTimeout` bigger?
The value is 300 by default.

// maropu


On Tue, Aug 30, 2016 at 9:09 PM, AssafMendelson 
wrote:

> Hi,
>
> I am seeing a broadcast failure when doing a join as follows:
>
> Assume I have a dataframe df with ~80 million records
>
> I do:
>
> df2 = df.filter(cond) # reduces to ~50 million records
>
> grouped = broadcast(df.groupby(df2.colA).count())
>
> total = df2.join(grouped, df2.colA == grouped.colA, “inner”)
>
> total.filter(total[“count”] > 10).show()
>
>
>
> This fails with an exception:
>
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>
> at org.apache.spark.util.ThreadUtils$.awaitResult(
> ThreadUtils.scala:194)
>
> at org.apache.spark.sql.execution.exchange.
> BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>
> at org.apache.spark.sql.execution.InputAdapter.
> doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeBroadcast$1.apply(SparkPlan.scala:125)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeBroadcast$1.apply(SparkPlan.scala:125)
>
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:136)
>
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>
> at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:133)
>
> at org.apache.spark.sql.execution.SparkPlan.
> executeBroadcast(SparkPlan.scala:124)
>
> at org.apache.spark.sql.execution.joins.
> BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>
> at org.apache.spark.sql.execution.joins.
> BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>
> at org.apache.spark.sql.execution.joins.
> BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>
> at org.apache.spark.sql.execution.CodegenSupport$
> class.consume(WholeStageCodegenExec.scala:153)
>
> at org.apache.spark.sql.execution.ProjectExec.consume(
> basicPhysicalOperators.scala:30)
>
> at org.apache.spark.sql.execution.ProjectExec.doConsume(
> basicPhysicalOperators.scala:62)
>
> at org.apache.spark.sql.execution.CodegenSupport$
> class.consume(WholeStag eCodegenExec.scala:153)
>
> at org.apache.spark.sql.execution.FilterExec.consume(
> basicPhysicalOperators.scala:79)
>
>
>
> However, if I do:
>
> grouped.cache()
>
> grouped.count()
>
>
>
> before the join everything is fine (btw the grouped dataframe is 1.5MB
> when cached in memory and I have more than 4GB per executor with 8
> executors, the full dataframe is ~8GB)
>
>
>
> Thanks,
>
> Assaf.
>
>
>
> --
> View this message in context: broadcast fails on join
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>



-- 
---
Takeshi Yamamuro

回复：ApplicationMaster + Fair Scheduler + Dynamic resource allocation

2016-08-30 Thread 梅西0247



1) Is that what you want?
 spark.yarn.am.memory when yarn-client
spark.driver.memory    when   yarn-cluster
2)I think you need to set these configs in spark-default.conf
spark.dynamicAllocation.minExecutors 
spark.dynamicAllocation.maxExecutors 


3) It's not about the fair scheduler.Instead of use a mapreduce conf, you need 
to set a env like this:export SPARK_EXECUTOR_CORES=6
--发件人：Cleosson 
José Pirani de Souza 发送时间：2016年8月30日(星期二) 19:30收件人：user 
主　题：ApplicationMaster + Fair Scheduler + Dynamic 
resource allocation
Hi 
 I am using Spark 1.6.2 and Hadoop 2.7.2 in a single node cluster 
(Pseudo-Distributed Operation settings for testing propose). For every spark 
application that I submit I get:  - ApplicationMaster with 1024 MB of RAM and 1 
vcore  - And one container with 1024 MB of RAM and 1 vcore I have three 
questions using dynamic allocation and Fair Scheduler:
  1) How do I set ApplicationMaster max memory to 512m ?  2) How do I get more 
than one container running per application ? (Using dynamic allocation I cannot 
set the spark.executor.instances)   3) I noticed that YARN ignores 
yarn.app.mapreduce.am.resource.mb, yarn.app.mapreduce.am.resource.cpu-vcores 
and yarn.app.mapreduce.am.command-opts when the scheduler is Fair, am I
 right ?

 My settings:
 Spark    # spark-defaults.conf    spark.driver.memory                512m    
spark.yarn.am.memory               512m    spark.executor.memory              
512m    spark.executor.cores               2    spark.dynamicAllocation.enabled 
   true    spark.shuffle.service.enabled  true YARN    # yarn-site.xml    
yarn.scheduler.maximum-allocation-vcores    32    
yarn.scheduler.minimum-allocation-vcores    1    
yarn.scheduler.maximum-allocation-mb        16384    
yarn.scheduler.minimum-allocation-mb        64    
yarn.scheduler.fair.preemption              true    
yarn.resourcemanager.scheduler.class        
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler    
yarn.nodemanager.aux-services               spark_shuffle    # mapred-site.xml  
  yarn.app.mapreduce.am.resource.mb           512    
yarn.app.mapreduce.am.resource.cpu-vcores   1    
yarn.app.mapreduce.am.command-opts          -Xmx384    mapreduce.map.memory.mb  
                   1024    mapreduce.map.java.opts                     -Xmx768m 
   mapreduce.reduce.memory.mb                  1024    
mapreduce.reduce.java.opts                  -Xmx768m
Thanks in advance,Cleosson

Hi, guys, does anyone use Spark in finance market?

2016-08-30 Thread Taotao.Li

Hi, guys,

 I'm a quant engineer in China, and I believe it's very promising when
using Spark in the financial market. But I didn't find cases which combine
spark and finance.

So here I wanna do a small survey:


   - do you guys use Spark in financial market related project?
   - if yes, how large data was fed in your spark application?


 thanks a lot.

*___*
**
A little ad, I attended IBM Spark Hackathon, which is here :
http://apachespark.devpost.com/ , and I submitted a small application,
which will be used in my strategies, hope you guys and give me a vote and
some suggestions on how to use spark in financial market, to discover some
trade opportunity.

here is my small app:
http://devpost.com/software/spark-in-finance-quantitative-investing

thanks a lot.


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao

Re: Spark metrics when running with YARN?

2016-08-30 Thread Vijay Kiran

Hi Otis,

Did you check the REST API as documented in  
http://spark.apache.org/docs/latest/monitoring.html 

Regards,
Vijay

> On 30 Aug 2016, at 14:43, Otis Gospodnetić  wrote:
> 
> Hi Mich and Vijay,
> 
> Thanks!  I forgot to include an important bit - I'm looking for a 
> programmatic way to get Spark metrics when running Spark under YARN - so JMX 
> or API of some kind.
> 
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh  
> wrote:
> Spark UI regardless of deployment mode Standalone, yarn etc runs on port 4040 
> by default that can be accessed directly
> 
> Otherwise one can specify a specific port with --conf "spark.ui.port=5" 
> for example 5
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 30 August 2016 at 11:48, Vijay Kiran  wrote:
> 
> From Yarm RM UI, find the spark application Id, and in the application 
> details, you can click on the “Tracking URL” which should give you the Spark 
> UI.
> 
> ./Vijay
> 
> > On 30 Aug 2016, at 07:53, Otis Gospodnetić  
> > wrote:
> >
> > Hi,
> >
> > When Spark is run on top of YARN, where/how can one get Spark metrics?
> >
> > Thanks,
> > Otis
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark metrics when running with YARN?

2016-08-30 Thread Otis Gospodnetić

Hi Mich and Vijay,

Thanks!  I forgot to include an important bit - I'm looking for a
*programmatic* way to get Spark metrics when running Spark under YARN - so
JMX or API of some kind.

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Aug 30, 2016 at 6:59 AM, Mich Talebzadeh 
wrote:

> Spark UI regardless of deployment mode Standalone, yarn etc runs on port
> 4040 by default that can be accessed directly
>
> Otherwise one can specify a specific port with --conf "spark.ui.port=5"
> for example 5
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 11:48, Vijay Kiran  wrote:
>
>>
>> From Yarm RM UI, find the spark application Id, and in the application
>> details, you can click on the “Tracking URL” which should give you the
>> Spark UI.
>>
>> ./Vijay
>>
>> > On 30 Aug 2016, at 07:53, Otis Gospodnetić 
>> wrote:
>> >
>> > Hi,
>> >
>> > When Spark is run on top of YARN, where/how can one get Spark metrics?
>> >
>> > Thanks,
>> > Otis
>> > --
>> > Monitoring - Log Management - Alerting - Anomaly Detection
>> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: Design patterns involving Spark

2016-08-30 Thread Todd Nist

Have not tried this, but looks quite useful if one is using Druid:

https://github.com/implydata/pivot  - An interactive data exploration UI
for Druid

On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman 
wrote:

> Thanks Mitch, i will check it.
>
> Cheers
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh :
>
>> You can use Hbase for building real time dashboards
>>
>> Check this link
>> 
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 08:33, Alonso Isidoro Roman 
>> wrote:
>>
>>> HBase for real time queries? HBase was designed with the batch in mind.
>>> Impala should be a best choice, but i do not know what Druid can do
>>>
>>>
>>> Cheers
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>>>
 Hi Chanh,

 Druid sounds like a good choice.

 But again the point being is that what else Druid brings on top of
 Hbase.

 Unless one decides to use Druid for both historical data and real time
 data in place of Hbase!

 It is easier to write API against Druid that Hbase? You still want a UI
 dashboard?

 Cheers

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 30 August 2016 at 03:19, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because
> Druid can store the raw data and can integrate with Spark also
> (theoretical).
> In that case do we need to store 2 separate storage Druid (store
> segment in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/Sparkli
> neData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> 
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we 
> can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write
>  straight to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can
> combine data from the current (Hbase) and the historical (Hive tables) to
> give the user visual analytics. Now that visual analytics can be Real time
> dashboard on top of Serving Layer. That Serving layer could be an 
> in-memory
> NoSQL offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The
> idea is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on

broadcast fails on join

2016-08-30 Thread AssafMendelson

Hi,
I am seeing a broadcast failure when doing a join as follows:
Assume I have a dataframe df with ~80 million records
I do:
df2 = df.filter(cond) # reduces to ~50 million records
grouped = broadcast(df.groupby(df2.colA).count())
total = df2.join(grouped, df2.colA == grouped.colA, "inner")
total.filter(total["count"] > 10).show()

This fails with an exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStag 
eCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)

However, if I do:
grouped.cache()
grouped.count()

before the join everything is fine (btw the grouped dataframe is 1.5MB when 
cached in memory and I have more than 4GB per executor with 8 executors, the 
full dataframe is ~8GB)

Thanks,
Assaf.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-fails-on-join-tp27623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Writing to Hbase table from Spark

2016-08-30 Thread Todd Nist

Have  you looked at spark-packges.org?  There are several different HBase
connectors there, not sure if any meet  you need or not.

https://spark-packages.org/?q=hbase

HTH,

-Todd

On Tue, Aug 30, 2016 at 5:23 AM, ayan guha  wrote:

> You can use rdd level new hadoop format api and pass on appropriate
> classes.
> On 30 Aug 2016 19:13, "Mich Talebzadeh"  wrote:
>
>> Hi,
>>
>> Is there an existing interface to read from and write to Hbase table in
>> Spark.
>>
>> Similar to below for Parquet
>>
>> val s = spark.read.parquet("oraclehadoop.sales2")
>> s.write.mode("overwrite").parquet("oraclehadoop.sales4")
>>
>> Or need too write Hive table which is already defined over Hbase?
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

ApplicationMaster + Fair Scheduler + Dynamic resource allocation

2016-08-30 Thread Cleosson José Pirani de Souza

Hi

 I am using Spark 1.6.2 and Hadoop 2.7.2 in a single node cluster 
(Pseudo-Distributed Operation settings for testing propose).
 For every spark application that I submit I get:
  - ApplicationMaster with 1024 MB of RAM and 1 vcore
  - And one container with 1024 MB of RAM and 1 vcore
 I have three questions using dynamic allocation and Fair Scheduler:
  1) How do I set ApplicationMaster max memory to 512m ?
  2) How do I get more than one container running per application ? (Using 
dynamic allocation I cannot set the spark.executor.instances)
  3) I noticed that YARN ignores yarn.app.mapreduce.am.resource.mb, 
yarn.app.mapreduce.am.resource.cpu-vcores and 
yarn.app.mapreduce.am.command-opts when the scheduler is Fair, am I right ?

 My settings:

 Spark
# spark-defaults.conf
spark.driver.memory512m
spark.yarn.am.memory   512m
spark.executor.memory  512m
spark.executor.cores   2
spark.dynamicAllocation.enabledtrue
spark.shuffle.service.enabled   true
 YARN
# yarn-site.xml
yarn.scheduler.maximum-allocation-vcores32
yarn.scheduler.minimum-allocation-vcores1
yarn.scheduler.maximum-allocation-mb16384
yarn.scheduler.minimum-allocation-mb64
yarn.scheduler.fair.preemption  true
yarn.resourcemanager.scheduler.class
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.nodemanager.aux-services   spark_shuffle
# mapred-site.xml
yarn.app.mapreduce.am.resource.mb   512
yarn.app.mapreduce.am.resource.cpu-vcores   1
yarn.app.mapreduce.am.command-opts  -Xmx384
mapreduce.map.memory.mb 1024
mapreduce.map.java.opts -Xmx768m
mapreduce.reduce.memory.mb  1024
mapreduce.reduce.java.opts  -Xmx768m

Thanks in advance,
Cleosson

Re: Spark metrics when running with YARN?

2016-08-30 Thread Mich Talebzadeh

Spark UI regardless of deployment mode Standalone, yarn etc runs on port
4040 by default that can be accessed directly

Otherwise one can specify a specific port with --conf "spark.ui.port=5" for
example 5

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 11:48, Vijay Kiran  wrote:

>
> From Yarm RM UI, find the spark application Id, and in the application
> details, you can click on the “Tracking URL” which should give you the
> Spark UI.
>
> ./Vijay
>
> > On 30 Aug 2016, at 07:53, Otis Gospodnetić 
> wrote:
> >
> > Hi,
> >
> > When Spark is run on top of YARN, where/how can one get Spark metrics?
> >
> > Thanks,
> > Otis
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark metrics when running with YARN?

2016-08-30 Thread Vijay Kiran


From Yarm RM UI, find the spark application Id, and in the application details, 
you can click on the “Tracking URL” which should give you the Spark UI.

./Vijay

> On 30 Aug 2016, at 07:53, Otis Gospodnetić  wrote:
> 
> Hi,
> 
> When Spark is run on top of YARN, where/how can one get Spark metrics?
> 
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread shengshanzhang

Hi ,try this way

val df = sales_demand.join(product_master,sales_demand("INVENTORY_ITEM_ID") === 
product_master("INVENTORY_ITEM_ID"),"inner")

> 在 2016年8月30日，下午5:52，Jacek Laskowski  写道：
> 
> Hi Mich,
> 
> This is the first time I've been told about $ for string interpolation (as 
> the function not the placeholder). Thanks for letting me know about it!
> 
> What is often used is s"whatever you want to reference inside the string 
> $-prefix unless it is a complex expression" i.e.
> 
> scala> s"I'm using $spark in ${spark.version}"
> res0: String = I'm using org.apache.spark.sql.SparkSession@1fc1c7e in 
> 2.1.0-SNAPSHOT
> 
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/ 
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark 
> 
> Follow me at https://twitter.com/jaceklaskowski 
> 
> On Tue, Aug 30, 2016 at 10:27 AM, Mich Talebzadeh  > wrote:
> Actually I doubled checked this ‘s’ String Interpolator
> 
> In Scala
> 
> scala> val chars =  "This is Scala"
> chars: String = This is Scala
> scala> println($"$chars")
> This is Scala
> 
> OK so far fine. In shell (ksh) can do
> 
> chars="This is Scala"
> print "$chars"
> This is Scala
> 
> In Shell
> 
> print "$charsand it is interesting"
>  it is interesting
> 
> Have a problem with interpretation!
> 
> This will work by using {} brackers
> 
> print "${chars} and it is interesting"
> This is Scala and it is interesting
> 
> 
> In Scala
> 
> println($"$charsand it is interesting")
> :24: error: not found: value charsand
>println($"$charsand it is interesting")
> 
> Likewise shell this will work by deploying the brackets {}
> println($"${chars} and it is interesting")
> This is Scala and it is interesting
> 
> So I presume it is best practice to use ${} in Scala like shell? Although in 
> most cases $VALUE should work.
> 
> 
> Cheers
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 28 August 2016 at 09:25, Mich Talebzadeh  > wrote:
> Yes I realised that. Actually I thought it was s not $. it has been around in 
> shell for years say for actual values --> ${LOG_FILE}, for position 's/ etc
> 
> 
> cat ${LOG_FILE} | egrep -v 'rows affected|return status|&&&' | sed -e 's/^[   
>  ]*//g' -e 's/^//g' -e '/^$/d' > temp.out
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 28 August 2016 at 08:47, Jacek Laskowski  > wrote:
> Hi Mich,
> 
> This is Scala's string interpolation which allow for replacing $-prefixed 
> expressions with their values.
> 
> It's what cool kids use in Scala to do templating and concatenation 
> 
> Jacek
> 
> 
> On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh"  > wrote:
> What is   --> s below before the text of sql?
> 
> var sales_order_sql_stmt = s"""SELECT ORDER_NUMBER , INVENTORY_ITEM_ID, 
> ORGANIZATION_ID,
> 
>   from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'), 
> '-MM-dd') AS schedule_date
> 
>   FROM sales_order_demand
> 
>   WHERE unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd') >= 
> $planning_start_date  limit 10"""
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread Jacek Laskowski

Hi Mich,

This is the first time I've been told about $ for string interpolation (as
the function not the placeholder). Thanks for letting me know about it!

What is often used is s"whatever you want to reference inside the string
$-prefix unless it is a complex expression" i.e.

scala> s"I'm using $spark in ${spark.version}"
res0: String = I'm using org.apache.spark.sql.SparkSession@1fc1c7e in
2.1.0-SNAPSHOT


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Aug 30, 2016 at 10:27 AM, Mich Talebzadeh  wrote:

> Actually I doubled checked this ‘s’ String Interpolator
>
> In Scala
>
> scala> val chars =  "This is Scala"
> chars: String = This is Scala
>
> scala> println($"$chars")
> This is Scala
> OK so far fine. In shell (ksh) can do
>
> chars="This is Scala"
> print "$chars"
> This is Scala
>
> In Shell
>
> print "$charsand it is interesting"
>  it is interesting
>
> Have a problem with interpretation!
>
> This will work by using {} brackers
>
> print "${chars} and it is interesting"
> This is Scala and it is interesting
>
>
> In Scala
>
> println($"$charsand it is interesting")
> :24: error: not found: value charsand
>println($"$charsand it is interesting")
>
> Likewise shell this will work by deploying the brackets {}
>
> println($"${chars} and it is interesting")
> This is Scala and it is interesting
> So I presume it is best practice to use ${} in Scala like shell? Although
> in most cases $VALUE should work.
>
>
> Cheers
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 August 2016 at 09:25, Mich Talebzadeh 
> wrote:
>
>> Yes I realised that. Actually I thought it was s not $. it has been
>> around in shell for years say for actual values --> ${LOG_FILE}, for
>> position 's/ etc
>>
>>
>> cat ${LOG_FILE} | egrep -v 'rows affected|return status|&&&' | sed -e
>> 's/^[]*//g' -e 's/^//g' -e '/^$/d' > temp.out
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 August 2016 at 08:47, Jacek Laskowski  wrote:
>>
>>> Hi Mich,
>>>
>>> This is Scala's string interpolation which allow for replacing
>>> $-prefixed expressions with their values.
>>>
>>> It's what cool kids use in Scala to do templating and concatenation 
>>>
>>> Jacek
>>>
>>> On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh" 
>>> wrote:
>>>
 What is   --> s below before the text of sql?

 *var* sales_order_sql_stmt =* s*"""SELECT ORDER_NUMBER ,
 INVENTORY_ITEM_ID, ORGANIZATION_ID,

   from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'),
 '-MM-dd') AS schedule_date

   FROM sales_order_demand

   WHERE unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd') >= $
 planning_start_date  limit 10"""

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 23 August 2016 at 07:31, Deepak Sharma 
 wrote:

>
> On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma  > wrote:
>
>>

Re: Writing to Hbase table from Spark

2016-08-30 Thread ayan guha

You can use rdd level new hadoop format api and pass on appropriate
classes.
On 30 Aug 2016 19:13, "Mich Talebzadeh"  wrote:

> Hi,
>
> Is there an existing interface to read from and write to Hbase table in
> Spark.
>
> Similar to below for Parquet
>
> val s = spark.read.parquet("oraclehadoop.sales2")
> s.write.mode("overwrite").parquet("oraclehadoop.sales4")
>
> Or need too write Hive table which is already defined over Hbase?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Writing to Hbase table from Spark

2016-08-30 Thread Mich Talebzadeh

Hi,

Is there an existing interface to read from and write to Hbase table in
Spark.

Similar to below for Parquet

val s = spark.read.parquet("oraclehadoop.sales2")
s.write.mode("overwrite").parquet("oraclehadoop.sales4")

Or need too write Hive table which is already defined over Hbase?


Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: How to convert List into json object / json Array

2016-08-30 Thread Sivakumaran S

Look at scala.util.parsing.json or the Jackson library for json manipulation. 
Also read 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 


Regards,

Sivakumaran S

How to convert List into json object / json Array

2016-08-30 Thread Sree Eedupuganti

Here is the snippet of my code :

Dataset rows_salaries = 
spark.read().json("/Users/Macbook/Downloads/rows_salaries.json");
rows_salaries.createOrReplaceTempView("salaries");  
List df = spark.sql("select * from salaries").collectAsList();

I need to read the json data from 'List df = spark.sql("select * from 
salaries").collectAsList();'  but i am unable to convert 'df' to either 
JSONObject or JSONArray. Is the way am i going is right or any other way to 
fetch the JSON data. Any suggestions please...

Thanks...

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread Mich Talebzadeh

Actually I doubled checked this ‘s’ String Interpolator

In Scala

scala> val chars =  "This is Scala"
chars: String = This is Scala

scala> println($"$chars")
This is Scala
OK so far fine. In shell (ksh) can do

chars="This is Scala"
print "$chars"
This is Scala

In Shell

print "$charsand it is interesting"
 it is interesting

Have a problem with interpretation!

This will work by using {} brackers

print "${chars} and it is interesting"
This is Scala and it is interesting


In Scala

println($"$charsand it is interesting")
:24: error: not found: value charsand
   println($"$charsand it is interesting")

Likewise shell this will work by deploying the brackets {}

println($"${chars} and it is interesting")
This is Scala and it is interesting
So I presume it is best practice to use ${} in Scala like shell? Although
in most cases $VALUE should work.


Cheers





















Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 August 2016 at 09:25, Mich Talebzadeh 
wrote:

> Yes I realised that. Actually I thought it was s not $. it has been around
> in shell for years say for actual values --> ${LOG_FILE}, for position 's/
> etc
>
>
> cat ${LOG_FILE} | egrep -v 'rows affected|return status|&&&' | sed -e
> 's/^[]*//g' -e 's/^//g' -e '/^$/d' > temp.out
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 August 2016 at 08:47, Jacek Laskowski  wrote:
>
>> Hi Mich,
>>
>> This is Scala's string interpolation which allow for replacing $-prefixed
>> expressions with their values.
>>
>> It's what cool kids use in Scala to do templating and concatenation 
>>
>> Jacek
>>
>> On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh" 
>> wrote:
>>
>>> What is   --> s below before the text of sql?
>>>
>>> *var* sales_order_sql_stmt =* s*"""SELECT ORDER_NUMBER ,
>>> INVENTORY_ITEM_ID, ORGANIZATION_ID,
>>>
>>>   from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'),
>>> '-MM-dd') AS schedule_date
>>>
>>>   FROM sales_order_demand
>>>
>>>   WHERE unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd') >= $
>>> planning_start_date  limit 10"""
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 23 August 2016 at 07:31, Deepak Sharma  wrote:
>>>

 On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma 
 wrote:

> *val* *df** = **sales_demand**.**join**(**product_master**,*
> *sales_demand**.$"INVENTORY_ITEM_ID" =**== **product_master*
> *.$"INVENTORY_ITEM_ID",**"inner"**)*


 Ignore the last statement.
 It should look something like this:
 *val* *df** = **sales_demand**.**join**(**product_master**,$"*
 *sales_demand**.INVENTORY_ITEM_ID" =**== $"**product_master*
 *.INVENTORY_ITEM_ID",**"inner"**)*


 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman

Thanks Mitch, i will check it.

Cheers

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman

2016-08-30 9:52 GMT+02:00 Mich Talebzadeh :

> You can use Hbase for building real time dashboards
>
> Check this link
> 
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 08:33, Alonso Isidoro Roman 
> wrote:
>
>> HBase for real time queries? HBase was designed with the batch in mind.
>> Impala should be a best choice, but i do not know what Druid can do
>>
>>
>> Cheers
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>>
>>> Hi Chanh,
>>>
>>> Druid sounds like a good choice.
>>>
>>> But again the point being is that what else Druid brings on top of
>>> Hbase.
>>>
>>> Unless one decides to use Druid for both historical data and real time
>>> data in place of Hbase!
>>>
>>> It is easier to write API against Druid that Hbase? You still want a UI
>>> dashboard?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 30 August 2016 at 03:19, Chanh Le  wrote:
>>>
 Hi everyone,

 Seems a lot people using Druid for realtime Dashboard.
 I’m just wondering of using Druid for main storage engine because Druid
 can store the raw data and can integrate with Spark also (theoretical).
 In that case do we need to store 2 separate storage Druid (store
 segment in HDFS) and HDFS?.
 BTW did anyone try this one https://github.com/Sparkli
 neData/spark-druid-olap?

 Regards,
 Chanh

 On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
 wrote:

 Thanks Bhaarat and everyone.

 This is an updated version of the same diagram

 The frequency of Recent data is defined by the Windows length in Spark
 Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
 move any Spark granularity below 0.5 seconds in anger. For some
 applications like Credit card transactions and fraud detection. Data is
 stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
 well. The same Spark Streaming will write asynchronously to HDFS Hive
 tables.
 One school of thought is never write to Hive from Spark, write
  straight to Hbase and then read Hbase tables into Hive periodically?

 Now the third component in this layer is Serving Layer that can combine
 data from the current (Hbase) and the historical (Hive tables) to give the
 user visual analytics. Now that visual analytics can be Real time dashboard
 on top of Serving Layer. That Serving layer could be an in-memory NoSQL
 offering or Data from Hbase (Red Box) combined with Hive tables.

 I am not aware of any industrial strength Real time Dashboard.  The
 idea is that one uses such dashboard in real time. Dashboard in this sense
 meaning a general purpose API to data store of some type like on Serving
 layer to provide visual analytics real time on demand, combining real time
 data and aggregate views. As usual the devil in the detail.

 Let me know your thoughts. Anyway this is first cut pattern.

 Dr Mich Talebzadeh

 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Design patterns involving Spark

2016-08-30 Thread Mich Talebzadeh

You can use Hbase for building real time dashboards

Check this link


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 08:33, Alonso Isidoro Roman  wrote:

> HBase for real time queries? HBase was designed with the batch in mind.
> Impala should be a best choice, but i do not know what Druid can do
>
>
> Cheers
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :
>
>> Hi Chanh,
>>
>> Druid sounds like a good choice.
>>
>> But again the point being is that what else Druid brings on top of Hbase.
>>
>> Unless one decides to use Druid for both historical data and real time
>> data in place of Hbase!
>>
>> It is easier to write API against Druid that Hbase? You still want a UI
>> dashboard?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 03:19, Chanh Le  wrote:
>>
>>> Hi everyone,
>>>
>>> Seems a lot people using Druid for realtime Dashboard.
>>> I’m just wondering of using Druid for main storage engine because Druid
>>> can store the raw data and can integrate with Spark also (theoretical).
>>> In that case do we need to store 2 separate storage Druid (store segment
>>> in HDFS) and HDFS?.
>>> BTW did anyone try this one https://github.com/Sparkli
>>> neData/spark-druid-olap?
>>>
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Bhaarat and everyone.
>>>
>>> This is an updated version of the same diagram
>>>
>>> 
>>> 
>>> The frequency of Recent data is defined by the Windows length in Spark
>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>>> move any Spark granularity below 0.5 seconds in anger. For some
>>> applications like Credit card transactions and fraud detection. Data is
>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>>> tables.
>>> One school of thought is never write to Hive from Spark, write  straight
>>> to Hbase and then read Hbase tables into Hive periodically?
>>>
>>> Now the third component in this layer is Serving Layer that can combine
>>> data from the current (Hbase) and the historical (Hive tables) to give the
>>> user visual analytics. Now that visual analytics can be Real time dashboard
>>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>>
>>> I am not aware of any industrial strength Real time Dashboard.  The idea
>>> is that one uses such dashboard in real time. Dashboard in this sense
>>> meaning a general purpose API to data store of some type like on Serving
>>> layer to provide visual analytics real time on demand, combining real time
>>> data and aggregate views. As usual the devil in the detail.
>>>
>>>
>>>
>>> Let me know your thoughts. Anyway this is first cut pattern.
>>>
>>> 
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss,

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman

HBase for real time queries? HBase was designed with the batch in mind.
Impala should be a best choice, but i do not know what Druid can do


Cheers

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-08-30 8:56 GMT+02:00 Mich Talebzadeh :

> Hi Chanh,
>
> Druid sounds like a good choice.
>
> But again the point being is that what else Druid brings on top of Hbase.
>
> Unless one decides to use Druid for both historical data and real time
> data in place of Hbase!
>
> It is easier to write API against Druid that Hbase? You still want a UI
> dashboard?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 03:19, Chanh Le  wrote:
>
>> Hi everyone,
>>
>> Seems a lot people using Druid for realtime Dashboard.
>> I’m just wondering of using Druid for main storage engine because Druid
>> can store the raw data and can integrate with Spark also (theoretical).
>> In that case do we need to store 2 separate storage Druid (store segment
>> in HDFS) and HDFS?.
>> BTW did anyone try this one https://github.com/Sparkli
>> neData/spark-druid-olap?
>>
>>
>> Regards,
>> Chanh
>>
>>
>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Bhaarat and everyone.
>>
>> This is an updated version of the same diagram
>>
>> 
>> 
>> The frequency of Recent data is defined by the Windows length in Spark
>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>> move any Spark granularity below 0.5 seconds in anger. For some
>> applications like Credit card transactions and fraud detection. Data is
>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>> tables.
>> One school of thought is never write to Hive from Spark, write  straight
>> to Hbase and then read Hbase tables into Hive periodically?
>>
>> Now the third component in this layer is Serving Layer that can combine
>> data from the current (Hbase) and the historical (Hive tables) to give the
>> user visual analytics. Now that visual analytics can be Real time dashboard
>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>
>> I am not aware of any industrial strength Real time Dashboard.  The idea
>> is that one uses such dashboard in real time. Dashboard in this sense
>> meaning a general purpose API to data store of some type like on Serving
>> layer to provide visual analytics real time on demand, combining real time
>> data and aggregate views. As usual the devil in the detail.
>>
>>
>>
>> Let me know your thoughts. Anyway this is first cut pattern.
>>
>> 
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>>
>>> Hi Mich
>>>
>>> This is really helpful. I'm trying to wrap my head around the last
>>> diagram you shared (the one with kafka). In this diagram spark streaming is
>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>>> Queries, Dashboards" annotation. Based on this diagram, will real time
>>> queries be running on Spark or HBase?
>>>
>>> PS: My intention was not to steer the conversation away from what Ashok
>>> asked but I found the diagrams shared by Mich very insightful.
>>>
>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 In terms of positioning, Spark is really the first Big Data platform to
 integrate batch, streaming and interactive computations in a unified
 framework. What this boils down to is the fact that

How to convert List into json object / json Array

2016-08-30 Thread Sree Eedupuganti

Any suggesttions please.

Re: Design patterns involving Spark

2016-08-30 Thread Mich Talebzadeh

Hi Chanh,

Druid sounds like a good choice.

But again the point being is that what else Druid brings on top of Hbase.

Unless one decides to use Druid for both historical data and real time data
in place of Hbase!

It is easier to write API against Druid that Hbase? You still want a UI
dashboard?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 August 2016 at 03:19, Chanh Le  wrote:

> Hi everyone,
>
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one https://github.com/
> SparklineData/spark-druid-olap?
>
>
> Regards,
> Chanh
>
>
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Bhaarat and everyone.
>
> This is an updated version of the same diagram
>
> 
> 
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
>
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
>
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
>
>
>
> Let me know your thoughts. Anyway this is first cut pattern.
>
> 
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 August 2016 at 18:53, Bhaarat Sharma  wrote:
>
>> Hi Mich
>>
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>>
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>>
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>
>>>
>>>
>>>- *ETL & Batch*
>>>
>>> The first one is the most common one with Established tools like Sqoop,
>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>> execution engine for Hive at the storage level which  actually makes it
>>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>> vendors) processing engine. Personally I use Spark

Re: Spark metrics when running with YARN?

2016-08-30 Thread Mich Talebzadeh

Have you checked spark UI on port HOST:4040 by default?

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 30 August 2016 at 06:53, Otis Gospodnetić 
wrote:

> Hi,
>
> When Spark is run on top of YARN, where/how can one get Spark metrics?
>
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>

51 matches

Mail list logo