Re: Deep learning libraries for scala

2016-09-30 Thread Suresh Thalamati
Tensor frames

https://spark-packages.org/package/databricks/tensorframes 


Hope that helps
-suresh

> On Sep 30, 2016, at 8:00 PM, janardhan shetty  wrote:
> 
> Looking for scala dataframes in particular ?
> 
> On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue  > wrote:
> Skymind you could try. It is java
> 
> I never test though.
> 
> > On Sep 30, 2016, at 7:30 PM, janardhan shetty  > > wrote:
> >
> > Hi,
> >
> > Are there any good libraries which can be used for scala deep learning 
> > models ?
> > How can we integrate tensorflow with scala ML ?
> 



Re: Deep learning libraries for scala

2016-09-30 Thread janardhan shetty
Looking for scala dataframes in particular ?

On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue  wrote:

> Skymind you could try. It is java
>
> I never test though.
>
> > On Sep 30, 2016, at 7:30 PM, janardhan shetty 
> wrote:
> >
> > Hi,
> >
> > Are there any good libraries which can be used for scala deep learning
> models ?
> > How can we integrate tensorflow with scala ML ?
>


Re: Deep learning libraries for scala

2016-09-30 Thread Gavin Yue
Skymind you could try. It is java 

I never test though. 

> On Sep 30, 2016, at 7:30 PM, janardhan shetty  wrote:
> 
> Hi,
> 
> Are there any good libraries which can be used for scala deep learning models 
> ?
> How can we integrate tensorflow with scala ML ?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
It would be good to know which paper has inspired to implement the version
which we use in spark  2.0 decision trees ?

On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi 
wrote:

> It's a good question.  People have been publishing papers on decision
> trees and various methods of constructing and pruning them for over 30
> years.  I think it's rather a question for a historian at this point.
>
> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty 
> wrote:
>
>> Read this explanation but wondering if this algorithm has the base from a
>> research paper for detail understanding.
>>
>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott > > wrote:
>>
>>> The documentation details the algorithm being used at
>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
 Hi,

 Any help here is appreciated ..

 On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
 janardhan...@gmail.com> wrote:

> Is there a reference to the research paper which is implemented in
> spark 2.0 ?
>
> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
> janardhan...@gmail.com> wrote:
>
>> Which algorithm is used under the covers while doing decision trees
>> FOR SPARK ?
>> for example: scikit-learn (python) uses an optimised version of the
>> CART algorithm.
>>
>
>

>>>
>>
>


Deep learning libraries for scala

2016-09-30 Thread janardhan shetty
Hi,

Are there any good libraries which can be used for scala deep learning
models ?
How can we integrate tensorflow with scala ML ?


Re: get different results when debugging and running scala program

2016-09-30 Thread Jakob Odersky
There is no image attached, I'm not sure how the apache mailing lists
handle them. Can you provide the output as text?

best,
--Jakob

On Fri, Sep 30, 2016 at 8:25 AM, chen yong  wrote:
> Hello All,
>
>
>
> I am using IDEA 15.0.4 to debug a scala program. It is strange to me that
> the results were different when I debug or run the program. The differences
> can be seen in the attached filed run.jpg and debug.jpg. The code lines of
> the scala program are shown below.
>
>
> Thank you all
>
>
> ---
>
> import scala.collection.mutable.ArrayBuffer
>
> object TestCase1{
> def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={
> println("in")
> val test1=test.flatmap{
> case(item,count)=>
> val newPrefix=item
> println(count)
> val a=Iterator.single((newPrefix,count))
> func(a)
> val c = a
> c
> }
> test1
> }
> def main(args: Array[String]){
> val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))
> val test = freqItems.toIterator
> val result = func(test)
> val reer = result.toArray
> }
> }
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark on yarn enviroment var

2016-09-30 Thread Saurabh Malviya (samalviy)
Hi,

I am running spark on yarn using oozie.

When submit through command line using spark-submit spark is able to read env 
variable.  But while submit through oozie its not able toget env variable and 
don't see driver log.

Is there any way we specify env variable in oozie spark action.

Saurabh


Re: Design considerations for batch and speed layers

2016-09-30 Thread Rodrick Brown
We do processing millions of records using Kafka, Elastic Search, Accumulo,
Mesos, Spark & Vertica.

Their a pattern for this type of pipeline today called SMACK more about
here --
http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka


On Fri, Sep 30, 2016 at 4:55 PM, Ashok Kumar 
wrote:

> Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or
> something similar?
>
>
>
>
>
> On Friday, 30 September 2016, 17:17, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> I have designed this prototype for a risk business. Here I would like to
> discuss issues with batch layer. *Apologies about being long winded.*
>
> *Business objective*
>
> Reduce risk in the credit business while making better credit and trading
> decisions. Specifically, to identify risk trends within certain years of
> trading data. For example, measure the risk exposure in a give portfolio by
> industry, region, credit rating and other parameters. At the macroscopic
> level, analyze data across market sectors, over a given time horizon to
> asses risk changes
>
> *Deliverable*
> Enable real time and batch analysis of risk data
>
> *Batch technology stack used*
> Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
> tool, Zeppelin
>
> *Test volumes for POC*
> 1 message queue (csv format), 100 stock prices streaming in very 2
> seconds, 180K prices per hour, 4 million + per day
>
>
>1. prices to Kafka -> Zookeeper -> Flume -> HDFS
>2. HDFS daily partition for that day's data
>3. Hive external table looking at HDFS partitioned location
>4. Hive managed table populated every 15 minutes via cron from Hive
>external table (table type ORC partitioned by date). This is purely Hive
>job. Hive table is populated using insert/overwrite for that day to
>avoid boundary value/missing data etc.
>5. Typical batch ingestion time (Hive table populated from HDFS files)
>~ 2 minutes
>6. Data in Hive table has 15 minutes latency
>7. Zeppelin to be used as UI with Spark
>
>
> Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
> Within Spark shell, users can access batch tables in Hive *or *they have
> a choice of accessing raw data on HDFS files which gives them* real time
> access * (not to be confused with speed layer).  Using typical query with
> Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
> min. Running the same query (my typical query not user query) on Hive
> tables this time using Spark takes 6 seconds.
>
> However, there are some  design concerns:
>
>
>1. Zeppelin starts slowing down by the end of day. Sometimes it throws
>broken pipe message. I resolve this by restarting Zeppelin daemon.
>Potential show stopper
>2. As the volume of data increases throughout the day, performance
>becomes an issue
>3. Every 15 minutes when the cron starts, Hive insert/overwrites can
>potentially get in conflict with users throwing queries from
>Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
>users from accessing these tables (at partition level) until insert
>overwrite is done. This can be improved by better partitioning of Hive
>tables or relaxing ingestion time to half hour or one hour at a cost of
>more lagging. I tried Parquet tables in Hive but really no difference in
>performance gain. I have thought of replacing Hive with Hbase etc. but that
>brings new complications in as well without necessarily solving the issue.
>4. I am not convinced this design can scale up easily with 5 times
>more volume of data.
>5. We will also get real time data from RDBMS tables (Oracle, Sybase,
>MSSQL)using replication technologies such as Sap Replication Server. These
>currently deliver changed log data to Hive tables. So there is some
>compatibility issue here.
>
>
> So I am sure some members can add useful ideas :)
>
> Thanks
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


-- 

[image: Orchard Platform] 

*Rodrick Brown */ *DevOPs*

9174456839 / rodr...@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY

-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of 

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Peter Figliozzi
It's a good question.  People have been publishing papers on decision trees
and various methods of constructing and pruning them for over 30 years.  I
think it's rather a question for a historian at this point.

On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty 
wrote:

> Read this explanation but wondering if this algorithm has the base from a
> research paper for detail understanding.
>
> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott 
> wrote:
>
>> The documentation details the algorithm being used at
>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>
>> Thanks,
>> Kevin
>>
>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty > > wrote:
>>
>>> Hi,
>>>
>>> Any help here is appreciated ..
>>>
>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
 Is there a reference to the research paper which is implemented in
 spark 2.0 ?

 On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
 janardhan...@gmail.com> wrote:

> Which algorithm is used under the covers while doing decision trees
> FOR SPARK ?
> for example: scikit-learn (python) uses an optimised version of the
> CART algorithm.
>


>>>
>>
>


Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
Read this explanation but wondering if this algorithm has the base from a
research paper for detail understanding.

On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott 
wrote:

> The documentation details the algorithm being used at
> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>
> Thanks,
> Kevin
>
> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty 
> wrote:
>
>> Hi,
>>
>> Any help here is appreciated ..
>>
>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>> janardhan...@gmail.com> wrote:
>>
>>> Is there a reference to the research paper which is implemented in spark
>>> 2.0 ?
>>>
>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
 Which algorithm is used under the covers while doing decision trees FOR
 SPARK ?
 for example: scikit-learn (python) uses an optimised version of the
 CART algorithm.

>>>
>>>
>>
>


Re: Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-30 Thread satyajit vegesna
>
>
> i am trying to compile code using maven ,which was working with spark
> 1.6.2, but when i try for spark 2.0.0 then i get below error,
>
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on
> project NginxLoads-repartition: wrap: 
> org.apache.commons.exec.ExecuteException:
> Process exited with an error: 1 (Exit value: 1)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:212)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:153)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:145)
> at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.
> buildProject(LifecycleModuleBuilder.java:116)
> at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.
> buildProject(LifecycleModuleBuilder.java:80)
> at org.apache.maven.lifecycle.internal.builder.singlethreaded.
> SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.
> execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.codehaus.plexus.classworlds.launcher.Launcher.
> launchEnhanced(Launcher.java:289)
> at org.codehaus.plexus.classworlds.launcher.Launcher.
> launch(Launcher.java:229)
> at org.codehaus.plexus.classworlds.launcher.Launcher.
> mainWithExitCode(Launcher.java:415)
> at org.codehaus.plexus.classworlds.launcher.Launcher.
> main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1
> (Exit value: 1)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(
> DefaultBuildPluginManager.java:134)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
> MojoExecutor.java:207)
> ... 20 more
> Caused by: org.apache.commons.exec.ExecuteException: Process exited with
> an error: 1 (Exit value: 1)
> at org.apache.commons.exec.DefaultExecutor.executeInternal(
> DefaultExecutor.java:377)
> at org.apache.commons.exec.DefaultExecutor.execute(
> DefaultExecutor.java:160)
> at org.apache.commons.exec.DefaultExecutor.execute(
> DefaultExecutor.java:147)
> at scala_maven_executions.JavaMainCallerByFork.run(
> JavaMainCallerByFork.java:100)
> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:161)
> at scala_maven.ScalaCompilerSupport.doExecute(
> ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> ... 22 more
>
>
> PFB pom.xml that i am using, any help would be appreciated.
>
> 
> http://maven.apache.org/POM/4.0.0;
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
> 4.0.0
>
> NginxLoads-repartition
> NginxLoads-repartition
> 1.1-SNAPSHOT
> ${project.artifactId}
> This is a boilerplate maven project to start using Spark in 
> Scala
> 2010
>
> 
> 1.6
> 1.6
> UTF-8
> 2.11
> 2.11
> 
> 2.11.8
> 
>
> 
> 
> 
> cloudera-repo-releases
> https://repository.cloudera.com/artifactory/repo/
> 
> 
>
> 
> src/main/scala
> src/test/scala
> 
> 
> 
> maven-assembly-plugin
> 
> 
> package
> 
> single
> 
> 
> 
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-compiler-plugin
> 3.5.1
> 
> 1.7
> 1.7
> 
> 
> 
> 
> net.alchim31.maven
> scala-maven-plugin
> 3.2.2
> 
> 
> 
> 
>   

Re: Design considerations for batch and speed layers

2016-09-30 Thread Ashok Kumar
Can one design a fast pipeline with Kafka, Spark streaming and Hbase  or 
something similar?


 

On Friday, 30 September 2016, 17:17, Mich Talebzadeh 
 wrote:
 

 I have designed this prototype for a risk business. Here I would like to 
discuss issues with batch layer. Apologies about being long winded.

Business objective
Reduce risk in the credit business while making better credit and trading 
decisions. Specifically, to identify risk trends within certain years of 
trading data. For example, measure the risk exposure in a give portfolio by 
industry, region, credit rating and other parameters. At the macroscopic level, 
analyze data across market sectors, over a given time horizon to asses risk 
changes  DeliverableEnable real time and batch analysis of risk data Batch 
technology stack usedKafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, 
Spark as the query tool, Zeppelin
Test volumes for POC1 message queue (csv format), 100 stock prices streaming in 
very 2 seconds, 180K prices per hour, 4 million + per day   
 
   - prices to Kafka -> Zookeeper -> Flume -> HDFS
   - HDFS daily partition for that day's data
   - Hive external table looking at HDFS partitioned location
   - Hive managed table populated every 15 minutes via cron from Hive external 
table (table type ORC partitioned by date). This is purely Hive job. Hive table 
is populated using insert/overwrite for that day to avoid boundary 
value/missing data etc.
   - Typical batch ingestion time (Hive table populated from HDFS files) ~ 2 
minutes
   - Data in Hive table has 15 minutes latency
   - Zeppelin to be used as UI with Spark 

Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell. Within 
Spark shell, users can access batch tables in Hive or they have a choice of 
accessing raw data on HDFS files which gives them real time access  (not to be 
confused with speed layer).  Using typical query with Spark, to see the last 15 
minutes of real time data (T-15 -Now) takes 1 min. Running the same query (my 
typical query not user query) on Hive tables this time using Spark takes 6 
seconds.
However, there are some  design concerns:
   
   - Zeppelin starts slowing down by the end of day. Sometimes it throws broken 
pipe message. I resolve this by restarting Zeppelin daemon. Potential show 
stopper
   - As the volume of data increases throughout the day, performance becomes an 
issue
   - Every 15 minutes when the cron starts, Hive insert/overwrites can 
potentially get in conflict with users throwing queries from Zeppelin/Spark. I 
am sure that with exclusive writes, Hive will block all users from accessing 
these tables (at partition level) until insert overwrite is done. This can be 
improved by better partitioning of Hive tables or relaxing ingestion time to 
half hour or one hour at a cost of more lagging. I tried Parquet tables in Hive 
but really no difference in performance gain. I have thought of replacing Hive 
with Hbase etc. but that brings new complications in as well without 
necessarily solving the issue.
   - I am not convinced this design can scale up easily with 5 times more 
volume of data. 
   - We will also get real time data from RDBMS tables (Oracle, Sybase, 
MSSQL)using replication technologies such as Sap Replication Server. These 
currently deliver changed log data to Hive tables. So there is some 
compatibility issue here. 

So I am sure some members can add useful ideas :)
Thanks
Mich

 LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  

   

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Kevin Mellott
The documentation details the algorithm being used at
http://spark.apache.org/docs/latest/mllib-decision-tree.html

Thanks,
Kevin

On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty 
wrote:

> Hi,
>
> Any help here is appreciated ..
>
> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty  > wrote:
>
>> Is there a reference to the research paper which is implemented in spark
>> 2.0 ?
>>
>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty > > wrote:
>>
>>> Which algorithm is used under the covers while doing decision trees FOR
>>> SPARK ?
>>> for example: scikit-learn (python) uses an optimised version of the
>>> CART algorithm.
>>>
>>
>>
>


Re: Dataframe Grouping - Sorting - Mapping

2016-09-30 Thread Kevin Mellott
When you perform a .groupBy, you need to perform an aggregate immediately
afterwards.

For example:

val df1 = df.groupBy("colA").agg(sum(df1("colB")))
df1.show()

More information and examples can be found in the documentation below.

http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrame

Thanks,
Kevin

On Fri, Sep 30, 2016 at 5:46 AM, AJT  wrote:

> I'm looking to do the following with my Spark dataframe
> (1) val df1 = df.groupBy()
> (2) val df2 = df1.sort()
> (3) val df3 = df2.mapPartitions()
>
> I can already groupBy the column (in this case a long timestamp) - but have
> no idea how then to ensure the returned GroupedData is then sorted by the
> same timeStamp and the mapped to my set of functions
>
> Appreciate any help
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Dataframe-Grouping-Sorting-Mapping-tp27821.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Having parallelized job inside getPartitions method causes job hanging

2016-09-30 Thread Zhang, Yanyan


Hi there,

My team created a class extending RDD, and in the getPartitions method of which 
we have a parallelized job. We noticed Spark hangs if we do shuffling on our 
RDD instance.

I’m just wondering if it’s a valid use case and if the Spark team could provide 
us with some suggestion.

We are using Spark 1.6.0 and here’s our code snippet:

import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.rdd.RDD

class testRDD(@transient sc: SparkContext)
  extends RDD[(String, Int)](sc, Nil)
with Serializable{

  override def getPartitions: Array[Partition] = {
sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect()

val result = new Array[Partition](4)
for (i <- 0 until 4) {
  result(i) = new Partition {
override def index: Int = 0
  }
}
result
  }

  override def compute(split: Partition, context: TaskContext):
Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator
}

val y = new testRDD(sc)
y.map(r => r).reduceByKey(_+_).count()

Regards,
Yanyan Zhang






Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-30 Thread Marco Mistroni
Hi all
 this problem is still bothering me.
Here's my setup
- Ubuntu 16.06
- Java 8
- Spark 2.0
- have launched following command: ./build/mvn -X -Pyarn -Phadoop-2.7
-DskipTests clean package
and i am gettign this exception:

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) on project spark-mllib_2.10: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
Caused by: Compile failed via zinc server
at
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at
sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at
scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at
scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at
scala_maven.ScalaTestCompileMojo.execute(ScalaTestCompileMojo.java:48)
at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)

anyone found a similar error?

kr




On Sat, Sep 3, 2016 at 2:54 PM, Marco Mistroni  wrote:

> hi all
>
>  i am getting failures when building spark 2.0 on Ubuntu 16.06
> Here's details of what i have installed on the ubuntu host
> -  java 8
> - scala 2.11
> - git
>
> When i launch the command
>
> ./build/mvn  -Pyarn -Phadoop-2.7  -DskipTests clean package
>
> everything compiles sort of fine and at the end i get this exception
>
> INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:05 h
> [INFO] Finished at: 2016-09-03T13:25:27+00:00
> [INFO] Final Memory: 57M/208M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-streaming_2.11: Execution
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/
> PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build 

Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Vadim Semenov
Run more smaller executors: change `spark.executor.memory` to 32g and
`spark.executor.cores` to 2-4, for example.

Changing driver's memory won't help because it doesn't participate in
execution.

On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour 
wrote:

> Thank you for your replies.
>
> @Mich, using LIMIT 100 in the query prevents the exception but given the
> fact that there's enough memory, I don't think this should happen even
> without LIMIT.
>
> @Vadim, here's the full stack trace:
>
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
> with more than 17179869176 bytes
> at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskM
> emoryManager.java:241)
> at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryCo
> nsumer.java:121)
> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
> orter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
> orter.insertRecord(UnsafeExternalSorter.java:396)
> at org.apache.spark.sql.execution.UnsafeExternalRowSorter.inser
> tRow(UnsafeExternalRowSorter.java:94)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
> eratedIterator.sort_addToSorter$(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
> eratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gen
> eratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
> BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfu
> n$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.w
> rite(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
> scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I'm running spark in local mode so there is only one executor, the driver
> and spark.driver.memory is set to 64g. Changing the driver's memory doesn't
> help.
>
> *Babak Alipour ,*
> *University of Florida*
>
> On Fri, Sep 30, 2016 at 2:05 PM, Vadim Semenov <
> vadim.seme...@datadoghq.com> wrote:
>
>> Can you post the whole exception stack trace?
>> What are your executor memory settings?
>>
>> Right now I assume that it happens in UnsafeExternalRowSorter ->
>> UnsafeExternalSorter:insertRecord
>>
>> Running more executors with lower `spark.executor.memory` should help.
>>
>>
>> On Fri, Sep 30, 2016 at 12:57 PM, Babak Alipour 
>> wrote:
>>
>>> Greetings everyone,
>>>
>>> I'm trying to read a single field of a Hive table stored as Parquet in
>>> Spark (~140GB for the entire table, this single field should be just a few
>>> GB) and look at the sorted output using the following:
>>>
>>> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
>>>
>>> ​But this simple line of code gives:
>>>
>>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
>>> with more than 17179869176 bytes
>>>
>>> Same error for:
>>>
>>> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
>>>
>>> and:
>>>
>>> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
>>>
>>>
>>> I'm running this on a machine with more than 200GB of RAM, running in
>>> local mode with spark.driver.memory set to 64g.
>>>
>>> I do not know why it cannot allocate a big enough page, and why is it
>>> trying to allocate such a big page in the first place?
>>>
>>> I hope someone with more knowledge of Spark can shed some light on this.
>>> Thank you!
>>>
>>>
>>> *​Best regards,​*
>>> *Babak Alipour ,*
>>> *University of Florida*
>>>
>>
>>
>


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Babak Alipour
Thank you for your replies.

@Mich, using LIMIT 100 in the query prevents the exception but given the
fact that there's enough memory, I don't think this should happen even
without LIMIT.

@Vadim, here's the full stack trace:

Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with
more than 17179869176 bytes
at org.apache.spark.memory.TaskMemoryManager.allocatePage(
TaskMemoryManager.java:241)
at org.apache.spark.memory.MemoryConsumer.allocatePage(
MemoryConsumer.java:121)
at org.apache.spark.util.collection.unsafe.sort.
UnsafeExternalSorter.acquireNewPageIfNecessary(
UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.
UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(
UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.
hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$
anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(
BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I'm running spark in local mode so there is only one executor, the driver
and spark.driver.memory is set to 64g. Changing the driver's memory doesn't
help.

*Babak Alipour ,*
*University of Florida*

On Fri, Sep 30, 2016 at 2:05 PM, Vadim Semenov 
wrote:

> Can you post the whole exception stack trace?
> What are your executor memory settings?
>
> Right now I assume that it happens in UnsafeExternalRowSorter ->
> UnsafeExternalSorter:insertRecord
>
> Running more executors with lower `spark.executor.memory` should help.
>
>
> On Fri, Sep 30, 2016 at 12:57 PM, Babak Alipour 
> wrote:
>
>> Greetings everyone,
>>
>> I'm trying to read a single field of a Hive table stored as Parquet in
>> Spark (~140GB for the entire table, this single field should be just a few
>> GB) and look at the sorted output using the following:
>>
>> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
>>
>> ​But this simple line of code gives:
>>
>> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
>> with more than 17179869176 bytes
>>
>> Same error for:
>>
>> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
>>
>> and:
>>
>> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
>>
>>
>> I'm running this on a machine with more than 200GB of RAM, running in
>> local mode with spark.driver.memory set to 64g.
>>
>> I do not know why it cannot allocate a big enough page, and why is it
>> trying to allocate such a big page in the first place?
>>
>> I hope someone with more knowledge of Spark can shed some light on this.
>> Thank you!
>>
>>
>> *​Best regards,​*
>> *Babak Alipour ,*
>> *University of Florida*
>>
>
>


Re: Restful WS for Spark

2016-09-30 Thread Mahendra Kutare
Try Cloudera Livy https://github.com/cloudera/livy

It may be helpful for your requirement.


Cheers,

Mahendra
about.me/mahendrakutare

~~
Only those who will risk going too far can possibly find out how far one
can go.

On Fri, Sep 30, 2016 at 11:07 AM, ABHISHEK  wrote:

> Hello all,
> Have you tried accessing Spark application using Restful  web-services?
>
> I have requirement where remote user submit the  request with some data,
> it should be sent to Spark and job should run in Hadoop cluster mode.
> Output should be sent back to user.
>
> Please share your  expertise.
> Thanks,
> Abhishek
>


Re: Restful WS for Spark

2016-09-30 Thread gobi s
Hi All,

sample spark project which uses REST.
http://techgobi.blogspot.in/2016/09/bigdata-sample-project.html


On Fri, Sep 30, 2016 at 11:39 PM, Vadim Semenov  wrote:

> There're two REST job servers that work with spark:
>
> https://github.com/spark-jobserver/spark-jobserver
>
> https://github.com/cloudera/livy
>
>
> On Fri, Sep 30, 2016 at 2:07 PM, ABHISHEK  wrote:
>
>> Hello all,
>> Have you tried accessing Spark application using Restful  web-services?
>>
>> I have requirement where remote user submit the  request with some data,
>> it should be sent to Spark and job should run in Hadoop cluster mode.
>> Output should be sent back to user.
>>
>> Please share your  expertise.
>> Thanks,
>> Abhishek
>>
>
>


-- 

\ Gobi.S '


Re: Restful WS for Spark

2016-09-30 Thread Vadim Semenov
There're two REST job servers that work with spark:

https://github.com/spark-jobserver/spark-jobserver

https://github.com/cloudera/livy


On Fri, Sep 30, 2016 at 2:07 PM, ABHISHEK  wrote:

> Hello all,
> Have you tried accessing Spark application using Restful  web-services?
>
> I have requirement where remote user submit the  request with some data,
> it should be sent to Spark and job should run in Hadoop cluster mode.
> Output should be sent back to user.
>
> Please share your  expertise.
> Thanks,
> Abhishek
>


Restful WS for Spark

2016-09-30 Thread ABHISHEK
Hello all,
Have you tried accessing Spark application using Restful  web-services?

I have requirement where remote user submit the  request with some data, it
should be sent to Spark and job should run in Hadoop cluster mode. Output
should be sent back to user.

Please share your  expertise.
Thanks,
Abhishek


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Vadim Semenov
Can you post the whole exception stack trace?
What are your executor memory settings?

Right now I assume that it happens in UnsafeExternalRowSorter ->
UnsafeExternalSorter:insertRecord

Running more executors with lower `spark.executor.memory` should help.


On Fri, Sep 30, 2016 at 12:57 PM, Babak Alipour 
wrote:

> Greetings everyone,
>
> I'm trying to read a single field of a Hive table stored as Parquet in
> Spark (~140GB for the entire table, this single field should be just a few
> GB) and look at the sorted output using the following:
>
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
>
> ​But this simple line of code gives:
>
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
> with more than 17179869176 bytes
>
> Same error for:
>
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
>
> and:
>
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
>
>
> I'm running this on a machine with more than 200GB of RAM, running in
> local mode with spark.driver.memory set to 64g.
>
> I do not know why it cannot allocate a big enough page, and why is it
> trying to allocate such a big page in the first place?
>
> I hope someone with more knowledge of Spark can shed some light on this.
> Thank you!
>
>
> *​Best regards,​*
> *Babak Alipour ,*
> *University of Florida*
>


Re: DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Mich Talebzadeh
What will happen if you LIMIT the result set to 100 rows only -- select
from  order by field LIMIT 100. Will that work?

How about running the whole query WITHOUT order by?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 September 2016 at 17:57, Babak Alipour 
wrote:

> Greetings everyone,
>
> I'm trying to read a single field of a Hive table stored as Parquet in
> Spark (~140GB for the entire table, this single field should be just a few
> GB) and look at the sorted output using the following:
>
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
>
> ​But this simple line of code gives:
>
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page
> with more than 17179869176 bytes
>
> Same error for:
>
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
>
> and:
>
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
>
>
> I'm running this on a machine with more than 200GB of RAM, running in
> local mode with spark.driver.memory set to 64g.
>
> I do not know why it cannot allocate a big enough page, and why is it
> trying to allocate such a big page in the first place?
>
> I hope someone with more knowledge of Spark can shed some light on this.
> Thank you!
>
>
> *​Best regards,​*
> *Babak Alipour ,*
> *University of Florida*
>


DataFrame Sort gives Cannot allocate a page with more than 17179869176 bytes

2016-09-30 Thread Babak Alipour
Greetings everyone,

I'm trying to read a single field of a Hive table stored as Parquet in
Spark (~140GB for the entire table, this single field should be just a few
GB) and look at the sorted output using the following:

sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")

​But this simple line of code gives:

Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with
more than 17179869176 bytes

Same error for:

sql("SELECT " + field + " FROM MY_TABLE).sort(field)

and:

sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)


I'm running this on a machine with more than 200GB of RAM, running in local
mode with spark.driver.memory set to 64g.

I do not know why it cannot allocate a big enough page, and why is it
trying to allocate such a big page in the first place?

I hope someone with more knowledge of Spark can shed some light on this.
Thank you!


*​Best regards,​*
*Babak Alipour ,*
*University of Florida*


Design considerations for batch and speed layers

2016-09-30 Thread Mich Talebzadeh
I have designed this prototype for a risk business. Here I would like to
discuss issues with batch layer. *Apologies about being long winded.*

*Business objective*

Reduce risk in the credit business while making better credit and trading
decisions. Specifically, to identify risk trends within certain years of
trading data. For example, measure the risk exposure in a give portfolio by
industry, region, credit rating and other parameters. At the macroscopic
level, analyze data across market sectors, over a given time horizon to
asses risk changes


*Deliverable*
Enable real time and batch analysis of risk data

*Batch technology stack used*
Kafka -> zookeeper, Flume, HDFS (raw data), Hive, cron, Spark as the query
tool, Zeppelin

*Test volumes for POC*
1 message queue (csv format), 100 stock prices streaming in very 2 seconds,
180K prices per hour, 4 million + per day



   1. prices to Kafka -> Zookeeper -> Flume -> HDFS
   2. HDFS daily partition for that day's data
   3. Hive external table looking at HDFS partitioned location
   4. Hive managed table populated every 15 minutes via cron from Hive
   external table (table type ORC partitioned by date). This is purely Hive
   job. Hive table is populated using insert/overwrite for that day to
   avoid boundary value/missing data etc.
   5. Typical batch ingestion time (Hive table populated from HDFS files) ~
   2 minutes
   6. Data in Hive table has 15 minutes latency
   7. Zeppelin to be used as UI with Spark


Zeppelin will use Spark SQL (on Spark Thrift Server) and Spark shell.
Within Spark shell, users can access batch tables in Hive *or *they have a
choice of accessing raw data on HDFS files which gives them* real time
access * (not to be confused with speed layer).  Using typical query with
Spark, to see the last 15 minutes of real time data (T-15 -Now) takes 1
min. Running the same query (my typical query not user query) on Hive
tables this time using Spark takes 6 seconds.

However, there are some  design concerns:


   1. Zeppelin starts slowing down by the end of day. Sometimes it throws
   broken pipe message. I resolve this by restarting Zeppelin daemon.
   Potential show stopper
   2. As the volume of data increases throughout the day, performance
   becomes an issue
   3. Every 15 minutes when the cron starts, Hive insert/overwrites can
   potentially get in conflict with users throwing queries from
   Zeppelin/Spark. I am sure that with exclusive writes, Hive will block all
   users from accessing these tables (at partition level) until insert
   overwrite is done. This can be improved by better partitioning of Hive
   tables or relaxing ingestion time to half hour or one hour at a cost of
   more lagging. I tried Parquet tables in Hive but really no difference in
   performance gain. I have thought of replacing Hive with Hbase etc. but that
   brings new complications in as well without necessarily solving the issue.
   4. I am not convinced this design can scale up easily with 5 times more
   volume of data.
   5. We will also get real time data from RDBMS tables (Oracle, Sybase,
   MSSQL)using replication technologies such as Sap Replication Server. These
   currently deliver changed log data to Hive tables. So there is some
   compatibility issue here.


So I am sure some members can add useful ideas :)

Thanks

Mich




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


get different results when debugging and running scala program

2016-09-30 Thread chen yong
Hello All,



I am using IDEA 15.0.4 to debug a scala program. It is strange to me that the 
results were different when I debug or run the program. The differences can be 
seen in the attached filed run.jpg and debug.jpg. The code lines of the scala 
program are shown below.


Thank you all


---

import scala.collection.mutable.ArrayBuffer

object TestCase1{
def func(test:Iterator[(Int,Long)]): Iterator[(Int,Long)]={
println("in")
val test1=test.flatmap{
case(item,count)=>
val newPrefix=item
println(count)
val a=Iterator.single((newPrefix,count))
func(a)
val c = a
c
}
test1
}
def main(args: Array[String]){
val freqItems = ArrayBuffer((2,3L),(3,2L),(4,1L))
val test = freqItems.toIterator
val result = func(test)
val reer = result.toArray
}
}





Replying same post with proper formatting. - sorry for extra mail

2016-09-30 Thread vatsal
In my Spark Streaming application I am reading data from certain Kafka topic.
While reading from topic whenever I encounter certain message (for example:
"poison") I want to stop the streaming. Currently I am achieving this using
following code:  jsc is instance of JavaStreamingContext and directStream is
instance of JavaPairInputDStream.

/
LongAccumulator poisonNotifier = sc.longAccumulator("poisonNotifier");

directStream.foreachRDD(rdd -> {
RDD rows = rdd.values().map(value -> {  
if (value.equals("poison") {
poisonNotifier.add(1);
} else {
... 
}
return row;
}).rdd();
});

jsc.start();
ExecutorService poisonMonitor = Executors.newSingleThreadExecutor();
poisonMonitor.execute(() -> {
while (true) {
if (poisonNotifier.value() > 0) {
jsc.stop(false, true);
break;
}
}
});
try {
jsc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
poisonMonitor.shutdown();/

Although this approach is working, this doesn't sounds like right approach
to me. Is there any other better(cleaner) way to achieve the same?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stopping-spark-steaming-context-on-encountering-certain-type-of-message-on-Kafka-tp27822p27823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Stopping spark steaming context on encountering certain type of message on Kafka

2016-09-30 Thread vatsal
In my Spark Streaming application I am reading data from certain Kafka topic.
While reading from topic whenever I encounter certain message (for example:
"poison") I want to stop the streaming. Currently I am achieving this using
following code:  jsc is instance of JavaStreamingContext and directStream is
instance of JavaPairInputDStream./LongAccumulator poisonNotifier =
sc.longAccumulator("poisonNotifier");directStream.foreachRDD(rdd -> {   
RDD rows = rdd.values().map(value -> {  if
(value.equals("poison") {poisonNotifier.add(1); 
  
} else {... }return row;
   
}).rdd();});jsc.start();ExecutorService poisonMonitor =
Executors.newSingleThreadExecutor();poisonMonitor.execute(() -> {while
(true) {if (poisonNotifier.value() > 0) {jsc.stop(false,
true);break;}}});try {jsc.awaitTermination();}
catch (InterruptedException e) {   
e.printStackTrace();}poisonMonitor.shutdown();/Although this approach is
working, this doesn't sounds like right approach to me. Is there any other
better(cleaner) way to achieve the same?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stopping-spark-steaming-context-on-encountering-certain-type-of-message-on-Kafka-tp27822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Grouped windows in spark streaming

2016-09-30 Thread Adrienne Kole
Hi all,

I am using Spark Streaming for my use case.
I want to
- partition or group the stream by key
- window the tuples in partitions
 and - find max/min element in windows (in every partition)

My code is like:

 val keyedStream = socketDataSource.map(s => (s.key,s.value))//
define key and value
 val aggregatedStream =
keyedStream.groupByKeyAndWindow(Milliseconds(8000),Milliseconds(4000))
// partition stream by key and window

.map(window=>minMaxTuples(window))   // reduce the window to find
max/min element



In minMaxTuples function I use window._2.toArray.maxBy/minBy to find
max/min element.

Maybe it is not the right way to do , if yes please correct me, but what I
realize inside minMaxTuples function is that, we are not reusing previously
computed results.

So, in the minibatch-n if we have a keyed window {key-1, [a,b,c,d,e]} and
we iterate for all elements( [a,b,c,d,e]) to find the result, in the next
minibatch (minibatch-n+1) we may have {key-1, [c,d,e,f,g]}, in which
[c,d,e] are overlapping.
So, especially for large windows, this can be significant performance issue
I think.

Any solution for this?


Thanks
Adrienne


Re: SPARK CREATING EXTERNAL TABLE

2016-09-30 Thread Mich Talebzadeh
This should work

Spark 2.0.0,  Hive 2.0.1

//create external table in a Hive database with CTAS

scala> spark.sql(""" CREATE EXTERNAL TABLE test.extPrices LOCATION
"/tmp/extPrices" AS SELECT * FROM test.prices LIMIT 5""")
res4: org.apache.spark.sql.DataFrame = []


Now if I go to Hive and look at that table, I see

hive> describe formatted test.extPrices;
OK
# col_name  data_type   comment
timeinsertedstring
index   int
Location:   hdfs://rhes564:9000/tmp/extPrices
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNALTRUE
numFiles1
totalSize   292
transient_lastDdlTime   1475243290
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:


Defined as external.

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 September 2016 at 12:40, Trinadh Kaja  wrote:

> Hi All,
>
> I am facing different problem using spark,
>
> i am using spark-sql.
>
> below are the details,
>
> sqlcontext.sql("""create external table  location '/' as select *
> from XXX""" )
>
> this is my query
> table success fully done but in hive command
>
> describe formatted 
>
> showing MANAGETABLE, totally confused,
>
> data loaded in hdfs path successfully,
>
> why hive showing manage table,
>
> i am missing any thing,plz tell me
>
> --
> Thanks
> K.Trinadh
> Ph-7348826118
>


Re: YARN - Pyspark

2016-09-30 Thread ayan guha
I understand, thank you for explanation. However, I ran using yarn-client
mode, submitted using nohup and I could see the logs getting into log file
throughout the life of the job.everything worked well on spark side,
just Yarn reported success long before job actually completed. I would love
to understand if I am missing anything here

On Fri, Sep 30, 2016 at 8:32 PM, Timur Shenkao  wrote:

> It's not weird behavior. Did you run the job in cluster mode?
> I suspect your driver died / finished / stopped after 12 hours but your
> job continued. It's possible as you didn't output anything to console on
> driver node.
>
> Quite long time ago, when I just tried Spark Streaming, I launched PySpark
> Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z
> Drivers were gone but YARN containers (where computations on slaves were
> performed) remained.
> Nevertheless, I believe that final result in "some table" is corrupted
>
> On Fri, Sep 30, 2016 at 9:33 AM, ayan guha  wrote:
>
>> Hi
>>
>> I just observed a litlte weird behavior:
>>
>> I ran a pyspark job, very simple one.
>>
>> conf = SparkConf()
>> conf.setAppName("Historical Meter Load")
>> conf.set("spark.yarn.queue","root.Applications")
>> conf.set("spark.executor.instances","50")
>> conf.set("spark.executor.memory","10g")
>> conf.set("spark.yarn.executor.memoryOverhead","2048")
>> conf.set("spark.sql.shuffle.partitions",1000)
>> conf.set("spark.executor.cores","4")
>> sc = SparkContext(conf = conf)
>> sqlContext = HiveContext(sc)
>>
>> df = sqlContext.sql("some sql")
>>
>> c = df.count()
>>
>> df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")
>>
>> sc.stop()
>>
>> running is on CDH 5.7 cluster, Spark 1.6.0.
>>
>> Behavior observed: After few hours of running (definitely over 12H, but
>> not sure exacly when), Yarn reported job as Completed, finished
>> successfully, whereas the job kept running (I can see from Application
>> master link) for 22H. Timing of the job is expected. Behavior of YARN is
>> not.
>>
>> Is it a known issue? Is it a pyspark specific issue or same with scala as
>> well?
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


SPARK CREATING EXTERNAL TABLE

2016-09-30 Thread Trinadh Kaja
Hi All,

I am facing different problem using spark,

i am using spark-sql.

below are the details,

sqlcontext.sql("""create external table  location '/' as select *
from XXX""" )

this is my query
table success fully done but in hive command

describe formatted 

showing MANAGETABLE, totally confused,

data loaded in hdfs path successfully,

why hive showing manage table,

i am missing any thing,plz tell me

-- 
Thanks
K.Trinadh
Ph-7348826118


Dataframe Grouping - Sorting - Mapping

2016-09-30 Thread AJT
I'm looking to do the following with my Spark dataframe
(1) val df1 = df.groupBy()
(2) val df2 = df1.sort()
(3) val df3 = df2.mapPartitions()

I can already groupBy the column (in this case a long timestamp) - but have
no idea how then to ensure the returned GroupedData is then sorted by the
same timeStamp and the mapped to my set of functions

Appreciate any help
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Grouping-Sorting-Mapping-tp27821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: YARN - Pyspark

2016-09-30 Thread Timur Shenkao
It's not weird behavior. Did you run the job in cluster mode?
I suspect your driver died / finished / stopped after 12 hours but your job
continued. It's possible as you didn't output anything to console on driver
node.

Quite long time ago, when I just tried Spark Streaming, I launched PySpark
Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z
Drivers were gone but YARN containers (where computations on slaves were
performed) remained.
Nevertheless, I believe that final result in "some table" is corrupted

On Fri, Sep 30, 2016 at 9:33 AM, ayan guha  wrote:

> Hi
>
> I just observed a litlte weird behavior:
>
> I ran a pyspark job, very simple one.
>
> conf = SparkConf()
> conf.setAppName("Historical Meter Load")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","10g")
> conf.set("spark.yarn.executor.memoryOverhead","2048")
> conf.set("spark.sql.shuffle.partitions",1000)
> conf.set("spark.executor.cores","4")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> df = sqlContext.sql("some sql")
>
> c = df.count()
>
> df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")
>
> sc.stop()
>
> running is on CDH 5.7 cluster, Spark 1.6.0.
>
> Behavior observed: After few hours of running (definitely over 12H, but
> not sure exacly when), Yarn reported job as Completed, finished
> successfully, whereas the job kept running (I can see from Application
> master link) for 22H. Timing of the job is expected. Behavior of YARN is
> not.
>
> Is it a known issue? Is it a pyspark specific issue or same with scala as
> well?
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-30 Thread Steve Loughran

On 29 Sep 2016, at 10:37, Olivier Girardot 
> wrote:

I know that the code itself would not be the same, but it would be useful to at 
least have the pom/build.sbt transitive dependencies different when fetching 
the artifact with a specific classifier, don't you think ?
For now I've overriden them myself using the dependency versions defined in the 
pom.xml of spark.
So it's not a blocker issue, it may be useful to document it, but a blog post 
would be sufficient I think.



The problem here is that it's not directly something that maven repo is set up 
to deal with. What could be done would be to publish multiple pom-only 
artifacts, spark-scala-2.11-hadoop-2.6.pom which would declare the transitive 
stuff appropriately for the right version. You wouldn't need to actually 
rebuild everything, just declare a dependency on the spark 2.2 artifacts 
excluding all of hadoop 2.2, pulling in 2.6.

This wouldn't even need to be an org.apache.spark artifact, just something any 
can build and publish under their own name.

Volunteers?


RE: udf of aggregation in pyspark dataframe ?

2016-09-30 Thread Mendelson, Assaf
I may be missing something here, but it seems to me you can do it like this:
df.groupBy('a').agg(collect_list('c').alias("a",collect_list('d').alias("b")).withColumn('named_list'),
 my_zip(F.Col("a"), F.Col("b"))
without needing to write a new aggregation function

-Original Message-
From: peng yu [mailto:yupb...@gmail.com] 
Sent: Thursday, September 29, 2016 8:35 PM
To: user@spark.apache.org
Subject: Re: udf of aggregation in pyspark dataframe ?

df:  
-
a|b|c
---
1|m|n
1|x | j
2|m|x
...


import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType

def my_zip(c, d):
return dict(zip(c, d))

my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True)

df.groupBy('a').agg(my_zip(collect_list('c'),
collect_list('d')).alias('named_list'))



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27814.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Lots of spark-assembly jars localized to /usercache/username/filecache directory

2016-09-30 Thread Lantao Jin
Hi,
Our Spark is deployed on YARN and I found there were lots of spark-assembly
jars in the Spark heavy user filecache directory (aka
/usercache/username/filecache), and you know the assembly jar is bigger
than 100 MB before Spark v2. So all of them take 26GB (1/4 reserved
space) in most of Datanodes. And that's only one Spark heavy user. We have
lot's of spark heavy user.


BTW, I didn't put the assembly jar to HDFS. It is in $SPARK_HOME/lib and
submitted every time.

Best regards
Alan


Re: compatibility issue with Jersey2

2016-09-30 Thread SimonL
Hi, I'm a new subscriber, has their been any solution to the below issue?

Many thanks,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/compatibility-issue-with-Jersey2-tp24951p27820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



YARN - Pyspark

2016-09-30 Thread ayan guha
Hi

I just observed a litlte weird behavior:

I ran a pyspark job, very simple one.

conf = SparkConf()
conf.setAppName("Historical Meter Load")
conf.set("spark.yarn.queue","root.Applications")
conf.set("spark.executor.instances","50")
conf.set("spark.executor.memory","10g")
conf.set("spark.yarn.executor.memoryOverhead","2048")
conf.set("spark.sql.shuffle.partitions",1000)
conf.set("spark.executor.cores","4")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)

df = sqlContext.sql("some sql")

c = df.count()

df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")

sc.stop()

running is on CDH 5.7 cluster, Spark 1.6.0.

Behavior observed: After few hours of running (definitely over 12H, but not
sure exacly when), Yarn reported job as Completed, finished successfully,
whereas the job kept running (I can see from Application master link) for
22H. Timing of the job is expected. Behavior of YARN is not.

Is it a known issue? Is it a pyspark specific issue or same with scala as
well?


-- 
Best Regards,
Ayan Guha


Re: spark listener do not get fail status

2016-09-30 Thread Aseem Bansal
Hi

In case my previous email was lacking in details here are some more details.

- using Spark 2.0.0
- launching the job
using org.apache.spark.launcher.SparkLauncher.startApplication(myListener)
- checking state in the listener's stateChanged method


On Thu, Sep 29, 2016 at 5:24 PM, Aseem Bansal  wrote:

> Hi
>
> Submitting job via spark api but I never get fail status even when the job
> throws an exception or exit via System.exit(-1)
>
> How do I indicate via SparkListener API that my job failed?
>


Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
Hi,

Any help here is appreciated ..

On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty 
wrote:

> Is there a reference to the research paper which is implemented in spark
> 2.0 ?
>
> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty 
> wrote:
>
>> Which algorithm is used under the covers while doing decision trees FOR
>> SPARK ?
>> for example: scikit-learn (python) uses an optimised version of the CART
>> algorithm.
>>
>
>