date:20160504

Locality aware tree reduction

2016-05-04 Thread aymkhalil

Hello,

Is there a way to instruct treeReduce() to reduce RDD partitions on the same
node locally?

In my case, I'm using treeReduce() to reduce map results in parallel. My
reduce function is just arithmetically adding map values (i.e. no notion of
aggregation by key). As far as I understand, a shuffle will happen at each
treeReduce() stage using a hash partitioner with the RDD partition index as
input. I would like to enforce RDD partitions on the same node to be reduced
locally (i.e. no shuffling) and only shuffle when each node has one RDD
partition left and before the results are sent to the driver. 

I have few nodes and lots of partitions so I think this will give better
performance.

Thank you,
Ayman



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Locality-aware-tree-reduction-tp26885.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DeepSpark: where to start

2016-05-04 Thread Derek Chan


The blog post is a April Fool's joke. Read the last line in the post:

https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html



On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote:
I am trying to find info on deepspark. I read the article on 
databricks blog which doesnt mention a git repo but does say its open 
source.

Help me find the git repo for this. I found two and not sure which one is
the databricks deepspark:
https://github.com/deepspark/deepspark
https://github.com/nearbydelta/deepspark



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

ArrayIndexOutOfBoundsException in model selection via cross-validation sample with spark 1.6.1

2016-05-04 Thread Terry Hoo

All,

I met the ArrayIndexOutOfBoundsException when run the model selection via
cross-validation

sample with spark 1.6.1, did anyone else meet this before? How to resolve
this?

Call stack:
java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:343)
at
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:144)
at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:140)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:140)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:91)
at org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
at
org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
at
org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.ml.Estimator.fit(Estimator.scala:78)
at
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:104)
at
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:54)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:63)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:65)
at $iwC$$iwC$$iwC$$iwC.(:67)
at $iwC$$iwC$$iwC.(:69)
at $iwC$$iwC.(:71)
at $iwC.(:73)
at (:75)
at .(:79)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:680)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:677)
at
scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:104)
at scala.reflect.io.File.applyReader(File.scala:82)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$interpretAllFrom$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SparkILoop.scala:677)
at

Re: DeepSpark: where to start

2016-05-04 Thread Ted Yu

Did you notice the date of the blog :-) ?

On Wed, May 4, 2016 at 7:42 PM, Joice Joy  wrote:

> I am trying to find info on deepspark. I read the article on databricks
> blog which doesnt mention a git repo but does say its open source.
> Help me find the git repo for this. I found two and not sure which one is
> the databricks deepspark:
> https://github.com/deepspark/deepspark
> https://github.com/nearbydelta/deepspark
>

DeepSpark: where to start

2016-05-04 Thread Joice Joy

I am trying to find info on deepspark. I read the article on databricks
blog which doesnt mention a git repo but does say its open source.
Help me find the git repo for this. I found two and not sure which one is
the databricks deepspark:
https://github.com/deepspark/deepspark
https://github.com/nearbydelta/deepspark

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread Divya Gehlot

Hi,

My Javac version

C:\Users\Divya>javac -version
javac 1.7.0_79

C:\Users\Divya>java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

Do I need use higher version ?


Thanks,
Divya

On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote:

> Check your javac version, and update it.
>
>
> -- 原始邮件 --
> *发件人:* "Divya Gehlot";;
> *发送时间:* 2016年5月4日(星期三) 中午11:25
> *收件人:* "sunday2000"<2314476...@qq.com>;
> *抄送:* "user"; "user";
> *主题:* Re: spark 1.6.1 build failure of : scala-maven-plugin
>
> Hi ,
> Even I am getting the similar error
> Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> When I tried to build Phoenix Project using maven .
> Maven version : 3.3
> Java version - 1.7_67
> Phoenix - downloaded latest master from Git hub
> If anybody find the the resolution please share.
>
>
> Thanks,
> Divya
>
> On 3 May 2016 at 10:18, sunday2000 <2314476...@qq.com> wrote:
>
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 14.765 s
>> [INFO] Finished at: 2016-05-03T10:08:46+08:00
>> [INFO] Final Memory: 35M/191M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
>> on project spark-test-tags_2.10: Execution scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
>> -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
>> (scala-compile-first) on project spark-test-tags_2.10: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>> at
>> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
>> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
>> at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
>> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
>> ... 20 more
>> Caused by: Compile failed via zinc server
>> at
>> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
>> at
>> sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
>> at
>> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
>> at
>> scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
>> at
>> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
>> at

Re: Do I need to install Cassandra node on Spark Master node to work with Cassandra?

2016-05-04 Thread Yogesh Mahajan

You can have a Spark master where Cassandra is not running locally.  I have
tried this before.
Spark cluster and Cassandra cluster could be on two different hosts, but to
colocate, you can have both the executor and Cassandra node on same host.

Thanks,
http://www.snappydata.io/blog 

On Thu, May 5, 2016 at 6:06 AM, Vinayak Agrawal 
wrote:

> Hi All,
> I am working with a Cassandra cluster and moving towards installing Spark.
> However, I came across this Stackoverflow question which has confused me.
>
> http://stackoverflow.com/questions/33897586/apache-spark-driver-instead-of-just-the-executors-tries-to-connect-to-cassand
>
> Question:
> Do I need to install cassandra node on my Spark Master node so that Spark
> can connect with cassandra
> or
> Cassandra only needs to be on Spark worker nodes? It seemss logical
> considering data locality.
>
> Thanks
>
> --
> Vinayak Agrawal
>
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>

Do I need to install Cassandra node on Spark Master node to work with Cassandra?

2016-05-04 Thread Vinayak Agrawal

Hi All,
I am working with a Cassandra cluster and moving towards installing Spark.
However, I came across this Stackoverflow question which has confused me.
http://stackoverflow.com/questions/33897586/apache-spark-driver-instead-of-just-the-executors-tries-to-connect-to-cassand

Question:
Do I need to install cassandra node on my Spark Master node so that Spark
can connect with cassandra
or
Cassandra only needs to be on Spark worker nodes? It seemss logical
considering data locality.

Thanks

-- 
Vinayak Agrawal


"To Strive, To Seek, To Find and Not to Yield!"
~Lord Alfred Tennyson

Re: yarn-cluster

2016-05-04 Thread nsalian

Hi,

this is a good spot to start for Spark and YARN.
https://spark.apache.org/docs/1.5.0/running-on-yarn.html

specific to the version you are on, you can toggle between pages.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-cluster-tp26846p26882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mich Talebzadeh

Hi,

More cores without getting memory per core ratio correct can result in more
queuing and hence more contention as was evident from the earlier published
results

I had a bit of discussion with one of the spark experts who stated/claimed
one should have one executor per server and then get parallelism via the
number of cores but I am not still convinced. I would still go for multiple
containers on the master/primary. The crucial factor here is memory per
core as I understand from tests adding more CPUs/cores above 80%
utilization on existing CPUs/core is the rule of thumb. Unless one is
getting those numbers after optimizing parallelism, adding more and more
cores is redundant so the ratio of memory per core becomes very relevant.
Of course If one could switch to faster CPUs and keep all other factors the
same, I expect one would see better performance immediately and again that
comes with a cost. If the entire box is busy, adding enough cores to keep
YARN in its own little world without cores having to fight other OS
processes for cores should help. Having said that in my opinion more
CPUs/cores is *not* better unless you have serious CPU contention in the
first place. Faster is better.

Back to your points if you run 6 workers with 32GB  and assuming 24 cores,
we are talking about each worker being allocated 24 cores and the ratio of
memory per core of 32/24 does not sound that great. An alternative approach
would be to have that number of workers reduced to 4 to give a better ratio
of 48/24.

Your mileage will vary depending on what the application will be doing. I
would just test it to get the best fit.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 4 May 2016 at 15:39, Simone Franzini  wrote:

> Hi Mohammed,
>
> Thanks for your reply. I agree with you, however a single application can
> use multiple executors as well, so I am still not clear which option is
> best. Let me make an example to be a little more concrete.
>
> Let's say I am only running a single application. Let's assume again that
> I have 192GB of memory and 24 cores on each node. Which one of the
> following two options is best and why:
> 1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set
> SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which
> is to assign all available cores to an executor in standalone mode).
> 2. Running 1 worker with 192GB memory and 6 executors/worker (i.e.
> SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5,
> spark.executor.memory=32GB).
>
> Also one more question. I understand that workers and executors are
> different processes. How many resources is the worker process actually
> using and how do I set those? As far as I understand the worker does not
> need many resources, as it is only spawning up executors. Is that correct?
>
> Thanks,
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller 
> wrote:
>
>> The workers and executors run as separate JVM processes in the standalone
>> mode.
>>
>>
>>
>> The use of multiple workers on a single machine depends on how you will
>> be using the clusters. If you run multiple Spark applications
>> simultaneously, each application gets its own its executor. So, for
>> example, if you allocate 8GB to each application, you can run 192/8 Spark
>> applications simultaneously (assuming you also have large number of cores).
>> Each executor has only 8GB heap, so GC should not be issue. Alternatively,
>> if you know that you will have few applications running simultaneously on
>> that cluster, running multiple workers on each machine will allow you to
>> avoid GC issues associated with allocating large heap to a single JVM
>> process. This option allows you to run multiple executors for an
>> application on a single machine and each executor can be configured with
>> optimal memory.
>>
>>
>>
>>
>>
>> Mohammed
>>
>> Author: Big Data Analytics with Spark
>> 
>>
>>
>>
>> *From:* Simone Franzini [mailto:captainfr...@gmail.com]
>> *Sent:* Monday, May 2, 2016 9:27 AM
>> *To:* user
>> *Subject:* Fwd: Spark standalone workers, executors and JVMs
>>
>>
>>
>> I am still a little bit confused about workers, executors and JVMs in
>> standalone mode.
>>
>> Are worker processes and executors independent JVMs or do executors run
>> within the worker JVM?
>>
>> I have some memory-rich nodes (192GB) and I would like to avoid deploying
>> massive JVMs due to well known performance issues (GC and such).
>>
>> As of Spark 1.4 it is possible to either deploy multiple workers
>> (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
>> worker

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak

Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.

# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, throws java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/")

Thanks,
Bijay

On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar  wrote:

> If you are running on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak 
> wrote:
>
>> Hi,
>>
>> I am reading the parquet file around 50+ G which has 4013 partitions with
>> 240 columns. Below is my configuration
>>
>> driver : 20G memory with 4 cores
>> executors: 45 executors with 15G memory and 4 cores.
>>
>> I tried to read the data using both Dataframe read and using hive context
>> to read the data using hive SQL but for the both cases, it throws me below
>> error with no  further description on error.
>>
>> hive_context.sql("select * from test.base_table where
>> date='{0}'".format(part_dt))
>> sqlcontext.read.parquet("/path/to/partion/")
>>
>> #
>> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 16953"...
>>
>>
>> What could be wrong over here since I think increasing memory only will
>> not help in this case since it reached the array size limit.
>>
>> Thanks,
>> Bijay
>>
>
>
>
> --
> --
> Cheers,
> Praj
>

RE: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mohammed Guller

Spark allows you configure the resources for the worker process. If I remember 
it correctly, you can use SPARK_DAEMON_MEMORY to control memory allocated to 
the worker process.

#1 below is more appropriate if you will be running just one application at a 
time. 32GB heap size is still too high. Depending on the garbage collector, you 
may see long pauses.

Mohammed
Author: Big Data Analytics with 
Spark

From: Simone Franzini [mailto:captainfr...@gmail.com]
Sent: Wednesday, May 4, 2016 7:40 AM
To: user
Subject: Re: Spark standalone workers, executors and JVMs

Hi Mohammed,

Thanks for your reply. I agree with you, however a single application can use 
multiple executors as well, so I am still not clear which option is best. Let 
me make an example to be a little more concrete.

Let's say I am only running a single application. Let's assume again that I 
have 192GB of memory and 24 cores on each node. Which one of the following two 
options is best and why:
1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set 
SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which is 
to assign all available cores to an executor in standalone mode).
2. Running 1 worker with 192GB memory and 6 executors/worker (i.e. 
SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5, 
spark.executor.memory=32GB).

Also one more question. I understand that workers and executors are different 
processes. How many resources is the worker process actually using and how do I 
set those? As far as I understand the worker does not need many resources, as 
it is only spawning up executors. Is that correct?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller 
> wrote:
The workers and executors run as separate JVM processes in the standalone mode.

The use of multiple workers on a single machine depends on how you will be 
using the clusters. If you run multiple Spark applications simultaneously, each 
application gets its own its executor. So, for example, if you allocate 8GB to 
each application, you can run 192/8 Spark applications simultaneously (assuming 
you also have large number of cores). Each executor has only 8GB heap, so GC 
should not be issue. Alternatively, if you know that you will have few 
applications running simultaneously on that cluster, running multiple workers 
on each machine will allow you to avoid GC issues associated with allocating 
large heap to a single JVM process. This option allows you to run multiple 
executors for an application on a single machine and each executor can be 
configured with optimal memory.


Mohammed
Author: Big Data Analytics with 
Spark

From: Simone Franzini 
[mailto:captainfr...@gmail.com]
Sent: Monday, May 2, 2016 9:27 AM
To: user
Subject: Fwd: Spark standalone workers, executors and JVMs

I am still a little bit confused about workers, executors and JVMs in 
standalone mode.
Are worker processes and executors independent JVMs or do executors run within 
the worker JVM?
I have some memory-rich nodes (192GB) and I would like to avoid deploying 
massive JVMs due to well known performance issues (GC and such).
As of Spark 1.4 it is possible to either deploy multiple workers 
(SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker 
(--executor-cores). Which option is preferable and why?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar

If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.

[1] https://spark.apache.org/docs/latest/tuning.html

On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>



-- 
--
Cheers,
Praj

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu

Have you seen this thread ?

http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit

On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>

SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak

Hi,

I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration

driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.

I tried to read the data using both Dataframe read and using hive context
to read the data using hive SQL but for the both cases, it throws me below
error with no  further description on error.

hive_context.sql("select * from test.base_table where
date='{0}'".format(part_dt))
sqlcontext.read.parquet("/path/to/partion/")

#
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 16953"...


What could be wrong over here since I think increasing memory only will not
help in this case since it reached the array size limit.

Thanks,
Bijay

DAG Pipelines?

2016-05-04 Thread Cesar Flores

I read on the ml-guide page (
http://spark.apache.org/docs/latest/ml-guide.html#details). It mention that
it is possible to construct DAG Pipelines. Unfortunately there is no
example to explain under which use case this may be useful.

*Can someone give me an example or use case where this functionality may be
useful?*



Thanks
-- 
Cesar Flores

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak

Thanks Ted. This looks like the issue since I am running it in EMR and the
Hive version is 1.0.0.


Thanks,
Bijay

On Wed, May 4, 2016 at 10:29 AM, Ted Yu  wrote:

> Looks like you were hitting HIVE-11940
>
> On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak 
> wrote:
>
>> Hello,
>>
>> I am writing Dataframe of around 60+ GB into partitioned Hive Table using
>> hiveContext in parquet format. The Spark insert overwrite jobs completes in
>> a reasonable amount of time around 20 minutes.
>>
>> But the job is taking a huge amount of time more than 2 hours to copy
>> data from .hivestaging directory in HDFS to final partition directory. What
>> could be the potential problem over here?
>>
>> hive_c.sql("""
>> INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', 
>> ccd_dt)
>> SELECT * from temp_table
>> """.format(table_name, eff_end_dt)
>> )
>>
>> And the below process from the log is taking more than 2 hours.
>>
>> 16/05/04 06:41:28 INFO Hive: Replacing 
>> src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306,
>>  dest: 
>> hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306,
>>  Status:true
>> 16/05/04 06:41:28 INFO Hive: New loading path = 
>> hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02
>>  with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02}
>>
>>
>> Thanks,
>> Bijay
>>
>
>

Stackoverflowerror in scala.collection

2016-05-04 Thread BenD

I am getting a java.lang.StackOverflowError somewhere in my program. I am not
able to pinpoint which part causes it because the stack trace seems to be
incomplete (see end of message). The error doesn't happen all the time, and
I think it is based on the number of files that I load. I am running on AWS
EMR with Spark 1.6.0 with an m1.xlarge as driver and 3 r8.xlarge (244GB ram
+ 32 cores each) and an r2.xlarge (61GB ram + 8 cores) as executor machines
with the following configuration:

spark.driver.cores  2
spark.yarn.executor.memoryOverhead  5000
spark.dynamicAllocation.enabled true
spark.executor.cores2
spark.driver.memory 14g
spark.executor.memory   12g

While I can't post the full code or data, I will give a basic outline. I am
loading many json files from S3 into a JavaRDD which is then mapped
to a JavaPairRDD where the Long is the timestamp of the file.
I then union the RDDs into a single RDD which is then turned into a
DataFrame. After I have this dataframe, I run an SQL query on it and then
dump the result to S3.

A cut down version of the code would look similar to this:

List> linesList = validFiles.map(x -> {
 try {
   Long date = dateMapper.call(x);
   return context.textFile(x.asPath()).mapToPair(y -> new
Tuple2<>(date, y));
 } catch (Exception e) {
  throw new RuntimeException(e);
 }
}).collect(Collectors.toList());

JavaPairRDD unionedRDD = linesList.get(0);
if (linesList.size() > 1) {
unionedRDD = context.union(unionedRDD , linesList.subList(1,
linesList.size()));
}

HiveContext sqlContext = new HiveContext(context);
DataFrame table = sqlContext.read().json(unionedRDD.values());
table.registerTempTable("table");
sqlContext.cacheTable("table");
dumpToS3(sqlContext.sql("query"));


This runs fine some times, but other times I get the
java.lang.StackOverflowError. I know the error happens on a run where 7800
files are loaded. Based on the error message mentioning mapped values, I
assume the problem occurs in the mapToPair function, but I don't know why it
happens. Does anyone have some insight into this problem?

This is the whole print out of the error as seen in the container log:
java.lang.StackOverflowError
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at

Re: Writing output of key-value Pair RDD

2016-05-04 Thread Nicholas Chammas

You're looking for this discussion:
http://stackoverflow.com/q/23995040/877069

Also, a simpler alternative with DataFrames:
https://github.com/apache/spark/pull/8375#issuecomment-202458325

On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick 
wrote:

> Hi,
>
>
> Is there any way to write out to S3 the values of a f key-value Pair RDD ?
>
>
> I'd like each value of a pair to be written to its own file where the file
> name corresponds to the key name.
>
>
> Thanks,
>
> --
>
> Nick
>

Writing output of key-value Pair RDD

2016-05-04 Thread Afshartous, Nick

Hi,


Is there any way to write out to S3 the values of a f key-value Pair RDD ?


I'd like each value of a pair to be written to its own file where the file name 
corresponds to the key name.


Thanks,

--

Nick

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Mich Talebzadeh

Hang on

 Are you talking about target database in MSSQL created and dropped?

Is your Spark JDBC credential in MSSQL a DBO or sa?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 4 May 2016 at 20:54, Andrés Ivaldi  wrote:

> Ok, so I did that, create a database and then insert data, but spark drop
> database and try to create it again, I'm using
> Dataframe.write(SaveMode.Overwrite), documentation said :
> "when performing a Overwrite, the data will be deleted before writing out
> the new data."
>
> why is dropping the table?
>
>
> On Wed, May 4, 2016 at 6:44 AM, Andrés Ivaldi  wrote:
>
>> Yes, I can do that, it's what we are doing now, but I think the best
>> approach would be delegate the create table action to spark.
>>
>> On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Can you create the MSSQL (target) table first with the correct column
>>> setting and insert data from Spark to it with JDBC as opposed to JDBC
>>> creating target table itself?
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 May 2016 at 22:19, Andrés Ivaldi  wrote:
>>>
 Ok, Spark MSSQL dataType mapping is not right for me, ie. string is
 Text instead of varchar(MAX) , so how can I override default SQL Mapping?

 regards.

 On Sun, May 1, 2016 at 5:23 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Well if MSSQL cannot create that column then it is more like
> compatibility between Spark and RDBMS.
>
> What value that column has in MSSQL. Can you create table the table in
> MSSQL database or map it in Spark to a valid column before opening JDBC
> connection?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 April 2016 at 16:16, Andrés Ivaldi  wrote:
>
>> Hello, Spark is executing a create table sentence (using JDBC) to
>> MSSQLServer with a mapping column type like ColName Bit(1) for boolean
>> types, This create table cannot be executed on MSSQLServer.
>>
>> In class JdbcDialect the mapping for Boolean type is Bit(1), so the
>> question is, this is a problem of spark or JDBC driver who is not mapping
>> right?
>>
>> Anyway it´s possible to override that mapping in Spark?
>>
>> Regards
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


 --
 Ing. Ivaldi Andres

>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>
>
> --
> Ing. Ivaldi Andres
>

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi

Ok, so I did that, create a database and then insert data, but spark drop
database and try to create it again, I'm using
Dataframe.write(SaveMode.Overwrite), documentation said :
"when performing a Overwrite, the data will be deleted before writing out
the new data."

why is dropping the table?


On Wed, May 4, 2016 at 6:44 AM, Andrés Ivaldi  wrote:

> Yes, I can do that, it's what we are doing now, but I think the best
> approach would be delegate the create table action to spark.
>
> On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh  > wrote:
>
>> Can you create the MSSQL (target) table first with the correct column
>> setting and insert data from Spark to it with JDBC as opposed to JDBC
>> creating target table itself?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 May 2016 at 22:19, Andrés Ivaldi  wrote:
>>
>>> Ok, Spark MSSQL dataType mapping is not right for me, ie. string is Text
>>> instead of varchar(MAX) , so how can I override default SQL Mapping?
>>>
>>> regards.
>>>
>>> On Sun, May 1, 2016 at 5:23 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Well if MSSQL cannot create that column then it is more like
 compatibility between Spark and RDBMS.

 What value that column has in MSSQL. Can you create table the table in
 MSSQL database or map it in Spark to a valid column before opening JDBC
 connection?

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 29 April 2016 at 16:16, Andrés Ivaldi  wrote:

> Hello, Spark is executing a create table sentence (using JDBC) to
> MSSQLServer with a mapping column type like ColName Bit(1) for boolean
> types, This create table cannot be executed on MSSQLServer.
>
> In class JdbcDialect the mapping for Boolean type is Bit(1), so the
> question is, this is a problem of spark or JDBC driver who is not mapping
> right?
>
> Anyway it´s possible to override that mapping in Spark?
>
> Regards
>
> --
> Ing. Ivaldi Andres
>


>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-05-04 Thread HLee

I had the same problem.  One forum post elsewhere suggested that too much
network communication might be using up available ports.  I reduced the
partition size via repartition(int) and it solved the problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark job stage failures

2016-05-04 Thread Prajwal Tuladhar

Hi,

I was wondering how Spark handle stage / task failures for a job.

We are running a Spark job to batch write to ElasticSearch and we are
seeing one or two stage failures due to ES cluster getting over loaded
(expected as we are testing with single node ES cluster). But I was
assuming that when some of the batch writes to ES fail after certain number
of retries (10), it should have aborted the whole job but we are seeing
that spark job marked as finished even though single job failed.








How does Spark handles failure when a job or stage is marked as failed?

Thanks in advance.


-- 
--
Cheers,
Praj

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Mich Talebzadeh

This works

spark 1.61, using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM,
Java 1.8.0_77)

Kafka version 0.9.0.1  using scala-library-2.11.7.jar

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 4 May 2016 at 19:52, Shixiong(Ryan) Zhu  wrote:

> It's because the Scala version of Spark and the Scala version of Kafka
> don't match. Please check them.
>
> On Wed, May 4, 2016 at 6:17 AM, أنس الليثي  wrote:
>
>> NoSuchMethodError usually appears because of a difference in the library
>> versions.
>>
>> Check the version of the libraries you downloaded, the version of spark,
>> the version of Kafka.
>>
>> On 4 May 2016 at 16:18, Luca Ferrari  wrote:
>>
>>> Hi,
>>>
>>> I’m new on Apache Spark and I’m trying to run the Spark Streaming +
>>> Kafka Integration Direct Approach example (JavaDirectKafkaWordCount.java
>>> ).
>>>
>>> I’ve downloaded all the libraries but when I try to run I get this error
>>>
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>>>
>>> at kafka.api.RequestKeys$.(RequestKeys.scala:48)
>>>
>>> at kafka.api.RequestKeys$.(RequestKeys.scala)
>>>
>>> at kafka.api.TopicMetadataRequest.(TopicMetadataRequest.scala:55)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:122)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
>>>
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
>>> at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53)
>>>
>>> Any suggestions?
>>>
>>> Cheers
>>> Luca
>>>
>>>
>>
>>
>>
>> --
>> Anas Rabei
>> Senior Software Developer
>> Mubasher.info
>> anas.ra...@mubasher.info
>>
>
>

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu

It's because the Scala version of Spark and the Scala version of Kafka
don't match. Please check them.

On Wed, May 4, 2016 at 6:17 AM, أنس الليثي  wrote:

> NoSuchMethodError usually appears because of a difference in the library
> versions.
>
> Check the version of the libraries you downloaded, the version of spark,
> the version of Kafka.
>
> On 4 May 2016 at 16:18, Luca Ferrari  wrote:
>
>> Hi,
>>
>> I’m new on Apache Spark and I’m trying to run the Spark Streaming +
>> Kafka Integration Direct Approach example (JavaDirectKafkaWordCount.java
>> ).
>>
>> I’ve downloaded all the libraries but when I try to run I get this error
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>>
>> at kafka.api.RequestKeys$.(RequestKeys.scala:48)
>>
>> at kafka.api.RequestKeys$.(RequestKeys.scala)
>>
>> at kafka.api.TopicMetadataRequest.(TopicMetadataRequest.scala:55)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:122)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
>>
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
>> at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53)
>>
>> Any suggestions?
>>
>> Cheers
>> Luca
>>
>>
>
>
>
> --
> Anas Rabei
> Senior Software Developer
> Mubasher.info
> anas.ra...@mubasher.info
>

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Lohith Samaga M

Hi
Can you look at Apache Drill as sql engine on hive?

Lohith

Sent from my Sony Xperia™ smartphone

 Tapan Upadhyay wrote 

Thank you everyone for guidance.

Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have 
enough bandwidth on our DB for imp batch/queries.

For implementing lambda architecture is it possible to get the real time 
updates from Teradata of any insert/update/delete? DBlogs?

Deepak should we query data from cassandra using spark? how it will be 
different in terms of performance if we store our data in hive tables(parquet) 
and query using spark? in case there is not much performance gain why add one 
more layer of processing

Mich we plan to sync the data using sqoop hourly/EOD jobs? still not decided 
how frequently we would need to do that. It will be based on user requirement. 
In case they need real time data we need to think of an alternative? How are 
you doing the same for Sybase? How you sync real time?

Thank you!!

Regards,
Tapan Upadhyay
+1 973 652 8757

On Wed, May 4, 2016 at 4:33 AM, Alonso Isidoro Roman 
> wrote:
I agree with Deepak and i would try to save data in parquet and avro format, if 
you can, try to measure the performance and choose the best, it will probably 
be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar 
debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must 
be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"

2016-05-04 9:22 GMT+02:00 Jörn Franke 
>:
Look at lambda architecture.

What is the motivation of your migration?

On 04 May 2016, at 03:29, Tapan Upadhyay 
> wrote:

Hi,

We are planning to move our adhoc queries from teradata to spark. We have huge 
volume of queries during the day. What is best way to go about it -

1) Read data directly from teradata db using spark jdbc

2) Import data using sqoop by EOD jobs into hive tables stored as parquet and 
then run queries on hive tables using spark sql or spark hive context.

any other ways through which we can do it in a better/efficiently?

Please guide.

Regards,
Tapan

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Ted Yu

Looks like you were hitting HIVE-11940

On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak 
wrote:

> Hello,
>
> I am writing Dataframe of around 60+ GB into partitioned Hive Table using
> hiveContext in parquet format. The Spark insert overwrite jobs completes in
> a reasonable amount of time around 20 minutes.
>
> But the job is taking a huge amount of time more than 2 hours to copy data
> from .hivestaging directory in HDFS to final partition directory. What
> could be the potential problem over here?
>
> hive_c.sql("""
> INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', 
> ccd_dt)
> SELECT * from temp_table
> """.format(table_name, eff_end_dt)
> )
>
> And the below process from the log is taking more than 2 hours.
>
> 16/05/04 06:41:28 INFO Hive: Replacing 
> src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306,
>  dest: 
> hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306,
>  Status:true
> 16/05/04 06:41:28 INFO Hive: New loading path = 
> hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02
>  with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02}
>
>
> Thanks,
> Bijay
>

Re: IS spark have CapacityScheduler?

2016-05-04 Thread Ted Yu

Cycling old bits:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scheduling-with-Capacity-scheduler-td10038.html

On Wed, May 4, 2016 at 7:44 AM, 开心延年  wrote:

> Scheduling Within an Application
>
> I found FAIRSchedule,but is there som exampe implements like yarn
> CapacityScheduler?
>
>
> 
>   
> FAIR
> 1
> 2
>   
>   
> FIFO
> 2
> 3
>   
>
>

Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak

Hello,

I am writing Dataframe of around 60+ GB into partitioned Hive Table using
hiveContext in parquet format. The Spark insert overwrite jobs completes in
a reasonable amount of time around 20 minutes.

But the job is taking a huge amount of time more than 2 hours to copy data
from .hivestaging directory in HDFS to final partition directory. What
could be the potential problem over here?

hive_c.sql("""
INSERT OVERWRITE TABLE {0} PARTITION
(row_eff_end_dt='{1}', ccd_dt)
SELECT * from temp_table
""".format(table_name, eff_end_dt)
)

And the below process from the log is taking more than 2 hours.

16/05/04 06:41:28 INFO Hive: Replacing
src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306,
dest: 
hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306,
Status:true
16/05/04 06:41:28 INFO Hive: New loading path =
hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02
with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02}


Thanks,
Bijay

unsubscribe

2016-05-04 Thread Vadim Vararu


unsubscribe

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: groupBy and store in parquet

2016-05-04 Thread Xinh Huynh

Hi Michal,

For (1), would it be possible to partitionBy two columns to reduce the
size? Something like partitionBy("event_type", "date").

For (2), is there a way to separate the different event types upstream,
like on different Kafka topics, and then process them separately?

Xinh

On Wed, May 4, 2016 at 7:47 AM, Michal Vince  wrote:

> Hi guys
>
> I`m trying to store kafka stream with ~5k events/s as efficiently as
> possible in parquet format to hdfs.
>
> I can`t make any changes to kafka (belongs to 3rd party)
>
>
> Events in kafka are in json format, but the problem is there are many
> different event types (from different subsystems with different number of
> fields, different size etc..) so it doesn`t make any sense to store them in
> the same file
>
>
> I was trying to read data to DF and then repartition it by event_type and
> store
>
> events.write.partitionBy("event_type").format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save(tmpFolder)
>
> which is quite fast but have 2 drawbacks that I`m aware of
>
> 1. output folder has only one partition which can be huge
>
> 2. all DFs created like this share the same schema, so even dfs with few
> fields have tons of null fields
>
>
> My second try is bit naive and really really slow (you can see why in
> code) - filter DF by event type and store them temporarily as json (to get
> rid of null fields)
>
> val event_types = events.select($"event_type").distinct().collect() // get 
> event_types in this batch
> for (row <- event_types) {
>   val currDF = events.filter($"event_type" === row.get(0))
>   val tmpPath = tmpFolder + row.get(0)
>   
> currDF.write.format("json").mode(org.apache.spark.sql.SaveMode.Append).save(tmpPath)
>   sqlContext.read.json(tmpPath).write.format("parquet").save(basePath)
>
> }hdfs.delete(new Path(tmpFolder), true)
>
>
> Do you have any suggestions for any better solution to this?
>
> thanks
>
>
>

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Tapan Upadhyay

Thank you everyone for guidance.

*Jorn* our motivation is to move bulk of adhoc queries to hadoop so that we
have enough bandwidth on our DB for imp batch/queries.

For implementing lambda architecture is it possible to get the real time
updates from Teradata of any insert/update/delete? DBlogs?

*Deepak *should we query data from cassandra using spark? how it will be
different in terms of performance if we store our data in hive
tables(parquet) and query using spark? in case there is not much
performance gain why add one more layer of processing

*Mich *we plan to sync the data using sqoop hourly/EOD jobs? still not
decided how frequently we would need to do that. It will be based on user
requirement. In case they need real time data we need to think of an
alternative? How are you doing the same for Sybase? How you sync real time?

Thank you!!

Regards,
Tapan Upadhyay
+1 973 652 8757

On Wed, May 4, 2016 at 4:33 AM, Alonso Isidoro Roman 
wrote:

> I agree with Deepak and i would try to save data in parquet and avro
> format, if you can, try to measure the performance and choose the best, it
> will probably be parquet, but you have to know for yourself.
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
> 2016-05-04 9:22 GMT+02:00 Jörn Franke :
>
>> Look at lambda architecture.
>>
>> What is the motivation of your migration?
>>
>> On 04 May 2016, at 03:29, Tapan Upadhyay  wrote:
>>
>> Hi,
>>
>> We are planning to move our adhoc queries from teradata to spark. We have
>> huge volume of queries during the day. What is best way to go about it -
>>
>> 1) Read data directly from teradata db using spark jdbc
>>
>> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
>> and then run queries on hive tables using spark sql or spark hive context.
>>
>> any other ways through which we can do it in a better/efficiently?
>>
>> Please guide.
>>
>> Regards,
>> Tapan
>>
>>
>

groupBy and store in parquet

2016-05-04 Thread Michal Vince


Hi guys

I`m trying to store kafka stream with ~5k events/s as efficiently as 
possible in parquet format to hdfs.


I can`t make any changes to kafka (belongs to 3rd party)


Events in kafka are in json format, but the problem is there are many 
different event types (from different subsystems with different number 
of fields, different size etc..) so it doesn`t make any sense to store 
them in the same file



I was trying to read data to DF and then repartition it by event_type 
and store


events.write.partitionBy("event_type").format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save(tmpFolder)

which is quite fast but have 2 drawbacks that I`m aware of

1. output folder has only one partition which can be huge

2. all DFs created like this share the same schema, so even dfs with few 
fields have tons of null fields



My second try is bit naive and really really slow (you can see why in 
code) - filter DF by event type and store them temporarily as json (to 
get rid of null fields)


val event_types =events.select($"event_type").distinct().collect() // get 
event_types in this batch

for (row <- event_types) {
  val currDF =events.filter($"event_type" === row.get(0))
  val tmpPath =tmpFolder + row.get(0)
  
currDF.write.format("json").mode(org.apache.spark.sql.SaveMode.Append).save(tmpPath)
  sqlContext.read.json(tmpPath).write.format("parquet").save(basePath)

}
hdfs.delete(new Path(tmpFolder),true)


Do you have any suggestions for any better solution to this?

thanks

IS spark have CapacityScheduler?

2016-05-04 Thread ????????

Scheduling Within an Application

I found FAIRSchedule,but is there som exampe implements like yarn 
CapacityScheduler?



 
FAIR 1 
2   
FIFO 2 
3

Re: restrict my spark app to run on specific machines

2016-05-04 Thread Ted Yu

Please refer to:
https://spark.apache.org/docs/latest/running-on-yarn.html

You can setup spark.yarn.am.nodeLabelExpression and
spark.yarn.executor.nodeLabelExpression corresponding to the 2 machines.

On Wed, May 4, 2016 at 3:03 AM, Shams ul Haque  wrote:

> Hi,
>
> I have a cluster of 4 machines for Spark. I want my Spark app to run on 2
> machines only. And rest 2 machines for other Spark apps.
> So my question is, can I restrict my app to run on that 2 machines only by
> passing some IP at the time of setting SparkConf or by any other setting?
>
>
> Thanks,
> Shams
>

Re: Spark standalone workers, executors and JVMs

2016-05-04 Thread Simone Franzini

Hi Mohammed,

Thanks for your reply. I agree with you, however a single application can
use multiple executors as well, so I am still not clear which option is
best. Let me make an example to be a little more concrete.

Let's say I am only running a single application. Let's assume again that I
have 192GB of memory and 24 cores on each node. Which one of the following
two options is best and why:
1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set
SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which
is to assign all available cores to an executor in standalone mode).
2. Running 1 worker with 192GB memory and 6 executors/worker (i.e.
SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5,
spark.executor.memory=32GB).

Also one more question. I understand that workers and executors are
different processes. How many resources is the worker process actually
using and how do I set those? As far as I understand the worker does not
need many resources, as it is only spawning up executors. Is that correct?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller 
wrote:

> The workers and executors run as separate JVM processes in the standalone
> mode.
>
>
>
> The use of multiple workers on a single machine depends on how you will be
> using the clusters. If you run multiple Spark applications simultaneously,
> each application gets its own its executor. So, for example, if you
> allocate 8GB to each application, you can run 192/8 Spark applications
> simultaneously (assuming you also have large number of cores). Each
> executor has only 8GB heap, so GC should not be issue. Alternatively, if
> you know that you will have few applications running simultaneously on that
> cluster, running multiple workers on each machine will allow you to avoid
> GC issues associated with allocating large heap to a single JVM process.
> This option allows you to run multiple executors for an application on a
> single machine and each executor can be configured with optimal memory.
>
>
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Simone Franzini [mailto:captainfr...@gmail.com]
> *Sent:* Monday, May 2, 2016 9:27 AM
> *To:* user
> *Subject:* Fwd: Spark standalone workers, executors and JVMs
>
>
>
> I am still a little bit confused about workers, executors and JVMs in
> standalone mode.
>
> Are worker processes and executors independent JVMs or do executors run
> within the worker JVM?
>
> I have some memory-rich nodes (192GB) and I would like to avoid deploying
> massive JVMs due to well known performance issues (GC and such).
>
> As of Spark 1.4 it is possible to either deploy multiple workers
> (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per
> worker (--executor-cores). Which option is preferable and why?
>
>
>
> Thanks,
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
>

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Cody Koeninger

Kafka 0.8.2 should be fine.

If it works on your laptop but not on CDH, as Sean said you'll
probably get better help on CDH forums.

On Wed, May 4, 2016 at 4:19 AM, Michel Hubert  wrote:
> We're running Kafka 0.8.2.2
> Is that the problem, why?
>
> -Oorspronkelijk bericht-
> Van: Sean Owen [mailto:so...@cloudera.com]
> Verzonden: woensdag 4 mei 2016 10:41
> Aan: Michel Hubert 
> CC: user@spark.apache.org
> Onderwerp: Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0
>
> Please try the CDH forums; this is the Spark list:
> http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark
>
> Before you even do that, I can tell you to double check you're running Kafka 
> 0.9.
>
> On Wed, May 4, 2016 at 9:29 AM, Michel Hubert  wrote:
>>
>>
>> Hi,
>>
>>
>>
>> After an upgrade to CDH 5.7.0 we have troubles with the Kafka to Spark
>> Streaming.
>>
>>
>>
>> The example jar doesn’t work:
>>
>>
>>
>> /opt/cloudera/parcels/CDH/lib/spark/bin/run-example
>> streaming.KafkaWordCount ….
>>
>>
>>
>> Attached is a log file.
>>
>>
>>
>> 16/05/04 10:06:23 WARN consumer.ConsumerFetcherThread:
>> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-14623491821
>> 72-13b9f2e7-0-2], Error in fetch
>> kafka.consumer.ConsumerFetcherThread$FetchRequest@47fb7e6d.
>> Possible cause: java.nio.BufferUnderflowException
>>
>> 16/05/04 10:06:24 WARN consumer.ConsumerFetcherThread:
>> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-14623491821
>> 72-13b9f2e7-0-1], Error in fetch
>> kafka.consumer.ConsumerFetcherThread$FetchRequest@73b9a762.
>> Possible cause: java.lang.IllegalArgumentException
>>
>>
>>
>> We have no problem running this from my development pc, only in
>> product op CDH 5.7.0 environment.
>>
>>
>>
>>
>>
>> Any ideas?
>>
>>
>>
>> Thanks,
>>
>> Michel
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>> additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark MLLib benchmarks

2016-05-04 Thread kmurph


Hi, 

I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset,
and not seeing much scale-up when I increase cores/executors/RAM according
to Spark tuning documentation.  I suspect I'm missing a trick in my
configuration.

I'm running on shared memory (96 cores, 256GB RAM) and testing various
combinations of:
Number of executors (1,2,4,8)
Number of cores per executor (1,2,4,8,12,24)
Memory per executor (calculated as per cloudera recommendations)
Of course in line with combined resource limits.

Also setting the RDD partitioning number to 2,4,6,8  (I see best results at
4 partitions, about 5% better than worse case).

Have also varied/switched the following settings:
Using the Kyro Serialiser
Setting driver memory
Setting for compressed ops
Dynamic scheduling
trying different storage levels for persisting RDDs

As we to up the cores in the best of these configurations we still see a
running time of 19-20 minutes.
Is there anything else I should be configuring to get better scale-up ?
Are there any documented TF-IDF benchmark results that I can make
comparisons with to validate (even if very approximate indirect
comparisons?)

Any advice would be much appreciated,
Thanks
Karen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-benchmarks-tp26878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Select Statement

2016-05-04 Thread Mich Talebzadeh

which database is that table, a Hive database?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 4 May 2016 at 14:44, Ted Yu  wrote:

> Please take a look
> at 
> sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
> :
>
>   } else if (key.startsWith("use:")) {
> SessionState.get().setCurrentDatabase(entry.getValue());
>
> bq. no such table winbox_prod_action_logs_1
>
> The above doesn't match the table name shown in your code.
>
> On Wed, May 4, 2016 at 2:39 AM, Sree Eedupuganti  wrote:
>
>> Hello Spark users, can we query the SQL SELECT statement in Spark using
>> Java.
>> if it is possible any suggestions please. I tried like this.How to pass
>> the database name.
>> Here my database name is nimbus and table name is winbox_opens.
>>
>> *Source Code :*
>>
>> *public class Select { public static class SquareKey implements
>> Function {   public Integer call(Row row) throws Exception {
>>return row.getInt(0) * row.getInt(0);   } } public static void
>> main(String[] args) throws Exception {  SparkConf s = new
>> SparkConf().setMaster("local[2]").setAppName("Select");  SparkContext sc =
>> new SparkContext(s);  HiveContext hc = new HiveContext(sc);   DataFrame rdd
>> = hc.sql("SELECT * FROM winbox_opens");   JavaRDD squaredKeys =
>> rdd.toJavaRDD().map(new SquareKey());   List result =
>> squaredKeys.collect();   for (Integer elem : result) {
>>  System.out.println(elem);   }  }}*
>>
>> *Error: Exception in thread "main"
>> org.apache.spark.sql.AnalysisException: no such table
>> winbox_prod_action_logs_1; line 1 pos 14*
>>
>> --
>> Best Regards,
>> Sreeharsha Eedupuganti
>> Data Engineer
>> innData Analytics Private Limited
>>
>
>

Re: Spark Select Statement

2016-05-04 Thread Ted Yu

Please take a look
at 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
:

  } else if (key.startsWith("use:")) {
SessionState.get().setCurrentDatabase(entry.getValue());

bq. no such table winbox_prod_action_logs_1

The above doesn't match the table name shown in your code.

On Wed, May 4, 2016 at 2:39 AM, Sree Eedupuganti  wrote:

> Hello Spark users, can we query the SQL SELECT statement in Spark using
> Java.
> if it is possible any suggestions please. I tried like this.How to pass
> the database name.
> Here my database name is nimbus and table name is winbox_opens.
>
> *Source Code :*
>
> *public class Select { public static class SquareKey implements
> Function {   public Integer call(Row row) throws Exception {
>return row.getInt(0) * row.getInt(0);   } } public static void
> main(String[] args) throws Exception {  SparkConf s = new
> SparkConf().setMaster("local[2]").setAppName("Select");  SparkContext sc =
> new SparkContext(s);  HiveContext hc = new HiveContext(sc);   DataFrame rdd
> = hc.sql("SELECT * FROM winbox_opens");   JavaRDD squaredKeys =
> rdd.toJavaRDD().map(new SquareKey());   List result =
> squaredKeys.collect();   for (Integer elem : result) {
>  System.out.println(elem);   }  }}*
>
> *Error: Exception in thread "main" org.apache.spark.sql.AnalysisException:
> no such table winbox_prod_action_logs_1; line 1 pos 14*
>
> --
> Best Regards,
> Sreeharsha Eedupuganti
> Data Engineer
> innData Analytics Private Limited
>

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread sunday2000

Check your javac version, and update it.




--  --
??: "Divya Gehlot";;
: 2016??5??4??(??) 11:25
??: "sunday2000"<2314476...@qq.com>; 
: "user"; "user"; 
: Re: spark 1.6.1 build failure of : scala-maven-plugin



Hi ,Even I am getting the similar error 
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile

When I tried to build Phoenix Project using maven .
Maven version : 3.3
Java version - 1.7_67
Phoenix - downloaded latest master from Git hub
If anybody find the the resolution please share.




Thanks,
Divya 


On 3 May 2016 at 10:18, sunday2000 <2314476...@qq.com> wrote:
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 14.765 s
[INFO] Finished at: 2016-05-03T10:08:46+08:00
[INFO] Final Memory: 35M/191M
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-test-tags_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Caused by: Compile failed via zinc server
at 
sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at 
sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at 
scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at 
scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
... 21 more
[ERROR] 
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-test-tags_2.10

Reading from cassandra store in rdd

2016-05-04 Thread Yasemin Kaya

Hi,

I asked this question datastax group but i want to ask also spark-user
group, someone may face this problem.

I have a data in Cassandra and want to get data to SparkRDD. I got an error
, searched it but nothing changed. Is there anyone can help me to fix it?
I can connect Cassandra and cqlsh there is no problem with them.
I took code from datastax github page

.
My code 
 , my error


I am using
*Cassandra 3.5*
*spark: 1.5.2*

*cassandra-driver : 3.0.0*
*spark-cassandra-connector_2.10-1.5.0.jar*
*spark-cassandra-connector-java_2.10-1.5.0-M1.jar*

Best,
yasemin

-- 
hiç ender hiç

Re: Error from reading S3 in Scala

2016-05-04 Thread Steve Loughran

On 4 May 2016, at 13:52, Zhang, Jingyu 
> wrote:

Thanks everyone,

One reason to use "s3a//" is because  I use "s3a//" in my development env 
(Eclipse) on a desktop. I will debug and test on my desktop then put jar file 
on EMR Cluster. I do not think "s3//" will works on a desktop.

s3n will work, it's just slower, and has a real performance problem if you 
close, say, a 2GB file while only 6 bytes in, as it will read to the end of the 
file first.

With helping from AWS suport, this bug is cause by the version of Joda-Time in 
my pom file is not match with aws-SDK.jar because AWS authentication requires a 
valid Date or x-amz-date header. It will work after update to joda-time 2.8.1, 
aws SDK 1.10.x and amazon-hadoop 2.6.1.

and Java 8u60, right?

But, it will shown exception on amazon-hadoop 2.7.2. The reason for using 
amazon-hadoop 2.7.2 is because in EMR 4.6.0 the supported version are Hadoop 
2.7.2, Spark 1.6.1.

oh, that's this problem.

https://issues.apache.org/jira/browse/HADOOP-13044
https://issues.apache.org/jira/browse/HADOOP-13050

the quickest fix for you is to check out Hadoop branch-2.7 and rebuild it with 
the AWS sdk library version bumped up to 10.10.60, httpclient also updated in 
sync. That may break some other things, that being the problem of 
mass-transitive-classpath-updates.

you could also provide a patch 
https://issues.apache.org/jira/browse/HADOOP-13062, using introspection for 
some of the AWS binding, so that you could then take a 2.7.3+ release and drop 
in whichever AWS JAR you wanted. That would be appreciated by many

Please let me know if you have a better idea to set up the development 
environment for debug and test.

Regards,

Jingyu

On 4 May 2016 at 20:32, James Hammerton > 
wrote:

On 3 May 2016 at 17:22, Gourav Sengupta 
> wrote:
Hi,

The best thing to do is start the EMR clusters with proper permissions in the 
roles that way you do not need to worry about the keys at all.

Another thing, why are we using s3a// instead of s3:// ?

Probably because of what's said about s3:// and s3n:// here (which is why I use 
s3a://):

https://wiki.apache.org/hadoop/AmazonS3

Regards,

James

Besides that you can increase s3 speeds using the instructions mentioned here: 
https://aws.amazon.com/blogs/aws/aws-storage-update-amazon-s3-transfer-acceleration-larger-snowballs-in-more-regions/

Regards,
Gourav

On Tue, May 3, 2016 at 12:04 PM, Steve Loughran 
> wrote:
don't put your secret in the URI, it'll only creep out in the logs.

Use the specific properties coverd in 
http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html, 
which you can set in your spark context by prefixing them with spark.hadoop.

you can also set the env vars, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY; 
SparkEnv will pick these up and set the relevant spark context keys for you

On 3 May 2016, at 01:53, Zhang, Jingyu 
> wrote:

Hi All,

I am using Eclipse with Maven for developing Spark applications. I got a error 
for Reading from S3 in Scala but it works fine in Java when I run them in the 
same project in Eclipse. The Scala/Java code and the error in following

Scala

val uri = URI.create("s3a://" + key + ":" + seckey + "@" + 
"graphclustering/config.properties");
val pt = new Path("s3a://" + key + ":" + seckey + "@" + 
"graphclustering/config.properties");
val fs = FileSystem.get(uri,ctx.hadoopConfiguration);
val  inputStream:InputStream = fs.open(pt);

Exception: on aws-java-1.7.4 and hadoop-aws-2.6.1

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: 
Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; 
Request ID: 8A56DC7BF0BFF09A), S3 Extended Request ID

at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1160)

at 
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:748)

at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:467)

at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302)

at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)

at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1050)

at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1027)

at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:688)

at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:222)

at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)

at com.news.report.graph.GraphCluster$.main(GraphCluster.scala:53)

at com.news.report.graph.GraphCluster.main(GraphCluster.scala)

16/05/03 10:49:17 INFO SparkContext: Invoking stop() from

Re: Spark and Kafka direct approach problem

2016-05-04 Thread أنس الليثي

NoSuchMethodError usually appears because of a difference in the library
versions.

Check the version of the libraries you downloaded, the version of spark,
the version of Kafka.

On 4 May 2016 at 16:18, Luca Ferrari  wrote:

> Hi,
>
> I’m new on Apache Spark and I’m trying to run the Spark Streaming + Kafka
> Integration Direct Approach example (JavaDirectKafkaWordCount.java).
>
> I’ve downloaded all the libraries but when I try to run I get this error
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>
> at kafka.api.RequestKeys$.(RequestKeys.scala:48)
>
> at kafka.api.RequestKeys$.(RequestKeys.scala)
>
> at kafka.api.TopicMetadataRequest.(TopicMetadataRequest.scala:55)
>
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:122)
>
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
>
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
>
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
>
> at
> org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
> at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53)
>
> Any suggestions?
>
> Cheers
> Luca
>
>



-- 
Anas Rabei
Senior Software Developer
Mubasher.info
anas.ra...@mubasher.info

Spark and Kafka direct approach problem

2016-05-04 Thread Luca Ferrari

Hi,

I’m new on Apache Spark and I’m trying to run the Spark Streaming + Kafka 
Integration Direct Approach example (JavaDirectKafkaWordCount.java).

I’ve downloaded all the libraries but when I try to run I get this error


Exception in thread "main" java.lang.NoSuchMethodError: 
scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

at kafka.api.RequestKeys$.(RequestKeys.scala:48)

at kafka.api.RequestKeys$.(RequestKeys.scala)

at kafka.api.TopicMetadataRequest.(TopicMetadataRequest.scala:55)

at 
org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:122)

at 
org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)

at 
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)

at 
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)

at 
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)

at 
org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)

at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53)

Any suggestions?

Cheers
Luca

Re: Error from reading S3 in Scala

2016-05-04 Thread Zhang, Jingyu

Thanks everyone,

One reason to use "s3a//" is because  I use "s3a//" in my development env
(Eclipse) on a desktop. I will debug and test on my desktop then put jar
file on EMR Cluster. I do not think "s3//" will works on a desktop.

With helping from AWS suport, this bug is cause by the version of Joda-Time
in my pom file is not match with aws-SDK.jar because AWS authentication
requires a valid Date or x-amz-date header. It will work after update to
joda-time 2.8.1, aws SDK 1.10.x and amazon-hadoop 2.6.1.

But, it will shown exception on amazon-hadoop 2.7.2. The reason for
using amazon-hadoop
2.7.2 is because in EMR 4.6.0 the supported version are Hadoop 2.7.2, Spark
1.6.1.

Please let me know if you have a better idea to set up the development
environment for debug and test.

Regards,

Jingyu





On 4 May 2016 at 20:32, James Hammerton  wrote:

>
>
> On 3 May 2016 at 17:22, Gourav Sengupta  wrote:
>
>> Hi,
>>
>> The best thing to do is start the EMR clusters with proper permissions in
>> the roles that way you do not need to worry about the keys at all.
>>
>> Another thing, why are we using s3a// instead of s3:// ?
>>
>
> Probably because of what's said about s3:// and s3n:// here (which is why
> I use s3a://):
>
> https://wiki.apache.org/hadoop/AmazonS3
>
> Regards,
>
> James
>
>
>> Besides that you can increase s3 speeds using the instructions mentioned
>> here:
>> https://aws.amazon.com/blogs/aws/aws-storage-update-amazon-s3-transfer-acceleration-larger-snowballs-in-more-regions/
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, May 3, 2016 at 12:04 PM, Steve Loughran 
>> wrote:
>>
>>> don't put your secret in the URI, it'll only creep out in the logs.
>>>
>>> Use the specific properties coverd in
>>> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html,
>>> which you can set in your spark context by prefixing them with spark.hadoop.
>>>
>>> you can also set the env vars, AWS_ACCESS_KEY_ID and
>>> AWS_SECRET_ACCESS_KEY; SparkEnv will pick these up and set the relevant
>>> spark context keys for you
>>>
>>>
>>> On 3 May 2016, at 01:53, Zhang, Jingyu  wrote:
>>>
>>> Hi All,
>>>
>>> I am using Eclipse with Maven for developing Spark applications. I got a
>>> error for Reading from S3 in Scala but it works fine in Java when I run
>>> them in the same project in Eclipse. The Scala/Java code and the error in
>>> following
>>>
>>>
>>> Scala
>>>
>>> val uri = URI.create("s3a://" + key + ":" + seckey + "@" +
>>> "graphclustering/config.properties");
>>> val pt = new Path("s3a://" + key + ":" + seckey + "@" +
>>> "graphclustering/config.properties");
>>> val fs = FileSystem.get(uri,ctx.hadoopConfiguration);
>>> val  inputStream:InputStream = fs.open(pt);
>>>
>>> Exception: on aws-java-1.7.4 and hadoop-aws-2.6.1
>>>
>>> Exception in thread "main"
>>> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
>>> Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
>>> 8A56DC7BF0BFF09A), S3 Extended Request ID
>>>
>>> at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(
>>> AmazonHttpClient.java:1160)
>>>
>>> at com.amazonaws.http.AmazonHttpClient.executeOneRequest(
>>> AmazonHttpClient.java:748)
>>>
>>> at com.amazonaws.http.AmazonHttpClient.executeHelper(
>>> AmazonHttpClient.java:467)
>>>
>>> at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302
>>> )
>>>
>>> at com.amazonaws.services.s3.AmazonS3Client.invoke(
>>> AmazonS3Client.java:3785)
>>>
>>> at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
>>> AmazonS3Client.java:1050)
>>>
>>> at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
>>> AmazonS3Client.java:1027)
>>>
>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(
>>> S3AFileSystem.java:688)
>>>
>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:222)
>>>
>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>>>
>>> at com.news.report.graph.GraphCluster$.main(GraphCluster.scala:53)
>>>
>>> at com.news.report.graph.GraphCluster.main(GraphCluster.scala)
>>>
>>> 16/05/03 10:49:17 INFO SparkContext: Invoking stop() from shutdown hook
>>>
>>> 16/05/03 10:49:17 INFO SparkUI: Stopped Spark web UI at
>>> http://10.65.80.125:4040
>>>
>>> 16/05/03 10:49:17 INFO MapOutputTrackerMasterEndpoint:
>>> MapOutputTrackerMasterEndpoint stopped!
>>>
>>> 16/05/03 10:49:17 INFO MemoryStore: MemoryStore cleared
>>>
>>> 16/05/03 10:49:17 INFO BlockManager: BlockManager stopped
>>>
>>> Exception: on aws-java-1.7.4 and hadoop-aws-2.7.2
>>>
>>> 16/05/03 10:23:40 INFO Slf4jLogger: Slf4jLogger started
>>>
>>> 16/05/03 10:23:40 INFO Remoting: Starting remoting
>>>
>>> 16/05/03 10:23:40 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sparkDriverActorSystem@10.65.80.125:61860]
>>>
>>> 16/05/03 10:23:40 INFO Utils: Successfully started service
>>> 'sparkDriverActorSystem' on

spark w/ scala 2.11 and PackratParsers

2016-05-04 Thread matd

Hi folks,

Our project is a mess of scala 2.10 and 2.11, so I tried to switch
everything to 2.11.

I had some exasperating errors like this :

java.lang.NoClassDefFoundError:
org/apache/spark/sql/execution/datasources/DDLParser
at org.apache.spark.sql.SQLContext.(SQLContext.scala:208)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:77)
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1295)

... that I was unable to fix, until I figured out that this error came first
:

java.lang.NoClassDefFoundError: scala/util/parsing/combinator/PackratParsers
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

...that finally managed to fix by adding this dependency :
"org.scala-lang.modules"  %% "scala-parser-combinators" % "1.0.4"

As this is not documented anywhere, I'd like to now if it's just a missing
doc somewhere, or if it's hiding another problem that will jump out at my
face at some point ?

Mathieu




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-w-scala-2-11-and-PackratParsers-tp26877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton

On 3 May 2016 at 17:22, Gourav Sengupta  wrote:

> Hi,
>
> The best thing to do is start the EMR clusters with proper permissions in
> the roles that way you do not need to worry about the keys at all.
>
> Another thing, why are we using s3a// instead of s3:// ?
>

Probably because of what's said about s3:// and s3n:// here (which is why I
use s3a://):

https://wiki.apache.org/hadoop/AmazonS3

Regards,

James


> Besides that you can increase s3 speeds using the instructions mentioned
> here:
> https://aws.amazon.com/blogs/aws/aws-storage-update-amazon-s3-transfer-acceleration-larger-snowballs-in-more-regions/
>
>
> Regards,
> Gourav
>
> On Tue, May 3, 2016 at 12:04 PM, Steve Loughran 
> wrote:
>
>> don't put your secret in the URI, it'll only creep out in the logs.
>>
>> Use the specific properties coverd in
>> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html,
>> which you can set in your spark context by prefixing them with spark.hadoop.
>>
>> you can also set the env vars, AWS_ACCESS_KEY_ID and
>> AWS_SECRET_ACCESS_KEY; SparkEnv will pick these up and set the relevant
>> spark context keys for you
>>
>>
>> On 3 May 2016, at 01:53, Zhang, Jingyu  wrote:
>>
>> Hi All,
>>
>> I am using Eclipse with Maven for developing Spark applications. I got a
>> error for Reading from S3 in Scala but it works fine in Java when I run
>> them in the same project in Eclipse. The Scala/Java code and the error in
>> following
>>
>>
>> Scala
>>
>> val uri = URI.create("s3a://" + key + ":" + seckey + "@" +
>> "graphclustering/config.properties");
>> val pt = new Path("s3a://" + key + ":" + seckey + "@" +
>> "graphclustering/config.properties");
>> val fs = FileSystem.get(uri,ctx.hadoopConfiguration);
>> val  inputStream:InputStream = fs.open(pt);
>>
>> Exception: on aws-java-1.7.4 and hadoop-aws-2.6.1
>>
>> Exception in thread "main"
>> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
>> Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
>> 8A56DC7BF0BFF09A), S3 Extended Request ID
>>
>> at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(
>> AmazonHttpClient.java:1160)
>>
>> at com.amazonaws.http.AmazonHttpClient.executeOneRequest(
>> AmazonHttpClient.java:748)
>>
>> at com.amazonaws.http.AmazonHttpClient.executeHelper(
>> AmazonHttpClient.java:467)
>>
>> at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302)
>>
>> at com.amazonaws.services.s3.AmazonS3Client.invoke(
>> AmazonS3Client.java:3785)
>>
>> at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
>> AmazonS3Client.java:1050)
>>
>> at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
>> AmazonS3Client.java:1027)
>>
>> at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(
>> S3AFileSystem.java:688)
>>
>> at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:222)
>>
>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>>
>> at com.news.report.graph.GraphCluster$.main(GraphCluster.scala:53)
>>
>> at com.news.report.graph.GraphCluster.main(GraphCluster.scala)
>>
>> 16/05/03 10:49:17 INFO SparkContext: Invoking stop() from shutdown hook
>>
>> 16/05/03 10:49:17 INFO SparkUI: Stopped Spark web UI at
>> http://10.65.80.125:4040
>>
>> 16/05/03 10:49:17 INFO MapOutputTrackerMasterEndpoint:
>> MapOutputTrackerMasterEndpoint stopped!
>>
>> 16/05/03 10:49:17 INFO MemoryStore: MemoryStore cleared
>>
>> 16/05/03 10:49:17 INFO BlockManager: BlockManager stopped
>>
>> Exception: on aws-java-1.7.4 and hadoop-aws-2.7.2
>>
>> 16/05/03 10:23:40 INFO Slf4jLogger: Slf4jLogger started
>>
>> 16/05/03 10:23:40 INFO Remoting: Starting remoting
>>
>> 16/05/03 10:23:40 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriverActorSystem@10.65.80.125:61860]
>>
>> 16/05/03 10:23:40 INFO Utils: Successfully started service
>> 'sparkDriverActorSystem' on port 61860.
>>
>> 16/05/03 10:23:40 INFO SparkEnv: Registering MapOutputTracker
>>
>> 16/05/03 10:23:40 INFO SparkEnv: Registering BlockManagerMaster
>>
>> 16/05/03 10:23:40 INFO DiskBlockManager: Created local directory at
>> /private/var/folders/sc/tdmkbvr1705b8p70xqj1kqks5l9p
>>
>> 16/05/03 10:23:40 INFO MemoryStore: MemoryStore started with capacity
>> 1140.4 MB
>>
>> 16/05/03 10:23:40 INFO SparkEnv: Registering OutputCommitCoordinator
>>
>> 16/05/03 10:23:40 INFO Utils: Successfully started service 'SparkUI' on
>> port 4040.
>>
>> 16/05/03 10:23:40 INFO SparkUI: Started SparkUI at
>> http://10.65.80.125:4040
>>
>> 16/05/03 10:23:40 INFO Executor: Starting executor ID driver on host
>> localhost
>>
>> 16/05/03 10:23:40 INFO Utils: Successfully started service
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61861.
>>
>> 16/05/03 10:23:40 INFO NettyBlockTransferService: Server created on 61861
>>
>> 16/05/03 10:23:40 INFO BlockManagerMaster: Trying to register BlockManager

restrict my spark app to run on specific machines

2016-05-04 Thread Shams ul Haque

Hi,

I have a cluster of 4 machines for Spark. I want my Spark app to run on 2
machines only. And rest 2 machines for other Spark apps.
So my question is, can I restrict my app to run on that 2 machines only by
passing some IP at the time of setting SparkConf or by any other setting?


Thanks,
Shams

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi

Yes, I can do that, it's what we are doing now, but I think the best
approach would be delegate the create table action to spark.

On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh 
wrote:

> Can you create the MSSQL (target) table first with the correct column
> setting and insert data from Spark to it with JDBC as opposed to JDBC
> creating target table itself?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 May 2016 at 22:19, Andrés Ivaldi  wrote:
>
>> Ok, Spark MSSQL dataType mapping is not right for me, ie. string is Text
>> instead of varchar(MAX) , so how can I override default SQL Mapping?
>>
>> regards.
>>
>> On Sun, May 1, 2016 at 5:23 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well if MSSQL cannot create that column then it is more like
>>> compatibility between Spark and RDBMS.
>>>
>>> What value that column has in MSSQL. Can you create table the table in
>>> MSSQL database or map it in Spark to a valid column before opening JDBC
>>> connection?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 April 2016 at 16:16, Andrés Ivaldi  wrote:
>>>
 Hello, Spark is executing a create table sentence (using JDBC) to
 MSSQLServer with a mapping column type like ColName Bit(1) for boolean
 types, This create table cannot be executed on MSSQLServer.

 In class JdbcDialect the mapping for Boolean type is Bit(1), so the
 question is, this is a problem of spark or JDBC driver who is not mapping
 right?

 Anyway it´s possible to override that mapping in Spark?

 Regards

 --
 Ing. Ivaldi Andres

>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres

Spark Select Statement

2016-05-04 Thread Sree Eedupuganti

Hello Spark users, can we query the SQL SELECT statement in Spark using
Java.
if it is possible any suggestions please. I tried like this.How to pass the
database name.
Here my database name is nimbus and table name is winbox_opens.

*Source Code :*

*public class Select { public static class SquareKey implements
Function {   public Integer call(Row row) throws Exception {
   return row.getInt(0) * row.getInt(0);   } } public static void
main(String[] args) throws Exception {  SparkConf s = new
SparkConf().setMaster("local[2]").setAppName("Select");  SparkContext sc =
new SparkContext(s);  HiveContext hc = new HiveContext(sc);   DataFrame rdd
= hc.sql("SELECT * FROM winbox_opens");   JavaRDD squaredKeys =
rdd.toJavaRDD().map(new SquareKey());   List result =
squaredKeys.collect();   for (Integer elem : result) {
 System.out.println(elem);   }  }}*

*Error: Exception in thread "main" org.apache.spark.sql.AnalysisException:
no such table winbox_prod_action_logs_1; line 1 pos 14*

-- 
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Alonso Isidoro Roman

Andy, i think there are some ideas to implement a pool of spark context,
but, for now, it is only an idea.


https://github.com/spark-jobserver/spark-jobserver/issues/365


It is possible to share a spark context between apps, i did not have to use
this feature, sorry about that.

Regards,

Alonso



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 11:08 GMT+02:00 Tobias Eriksson :

> Hi Andy,
>  We have a very simple approach I think, we do like this
>
>1. Submit our Spark application to the Spark Master (version 1.6.1.)
>2. Our Application creates a Spark Context that we use throughout
>3. We use Spray REST server
>4. Every request that comes in we simply serve by querying Cassandra
>doing some joins and some processing, and returns JSON as a result back on
>the REST-API
>5. We to take advantage of co-locating the Spark Workers with the
>Cassandra Nodes to “boost” performance (in our test lab we have a 4 node
>cluster)
>
> Performance wise we have had some challenges but that has had to do with
> how the data was arranged in Cassandra, after changing to the
> time-series-design-pattern we improved our performance dramatically, 750
> times in our test lab.
>
> But now the problem is that we have more Spark applications running
> concurrently/in parallell and we are then forced to scale down on the
> number of cores that OUR application can use to ensure that we give way for
> other applications to come in a “play” too. This is not optimal, cause if
> there is free resources then I would like to use them
>
> When it comes to having load balancing the REST requests, then in my case
> I will not have that many clients, yet in my case I think that I could
> scale by adding multiple instances of my Spark Applications, but would
> obviously suffer in having to share the resources between the different
> Spark Workers (say cores). Or I would have to use dynamic resourcing.
> But as I started out my question here this is where I struggle, I need to
> get this right with sharing the resources.
> This is a challenges since I rely on that I HAVE TO co-locate the Spark
> Workers and Cassandra Nodes, meaning that I can not have 3 out of 4 nodes,
> cause then the Cassandra access will not be efficient since I use
> repartitionByCassandraReplica()
>
> Satisfying 250ms requests, well that depends very much on your use case I
> would say, and boring answer :-( sorry
>
> Regards
>  Tobias
>
> From: Andy Davidson 
> Date: Tuesday 3 May 2016 at 17:26
> To: Tobias Eriksson , "user@spark.apache.org"
> 
> Subject: Re: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi Tobias
>
> I am very interested implemented rest based api on top of spark. My rest
> based system would make predictions from data provided in the request using
> models trained in batch. My SLA is 250 ms.
>
> Would you mind sharing how you implemented your rest server?
>
> I am using spark-1.6.1. I have several unit tests that create spark
> context, master is set to ‘local[4]’. I do not think the unit test frame is
> going to scale. Can each rest server have a pool of sparks contexts?
>
>
> The system would like to replacing is set up as follows
>
> Layer of dumb load balancers: l1, l2, l3
> Layer of proxy servers:   p1, p2, p3, p4, p5, ….. Pn
> Layer of containers:  c1, c2, c3, ….. Cn
>
> Where Cn is much larger than Pn
>
>
> Kind regards
>
> Andy
>
> P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in
> the near future.
>
> https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar_content=_source=brighttalk-portal_medium=calendar_term=
>
> From: Tobias Eriksson 
> Date: Tuesday, May 3, 2016 at 7:34 AM
> To: "user @spark" 
> Subject: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi
>  We are using Spark for a long running job, in fact it is a REST-server
> that does some joins with some tables in Casandra and returns the result.
> Now we need to have multiple applications running in the same Spark
> cluster, and from what I understand this is not possible, or should I say
> somewhat complicated
>
>1. A Spark application takes all the resources / nodes in the cluster
>(we have 4 nodes one for each Cassandra Node)
>2. A Spark application returns it’s resources when it is done (exits
>or the context is closed/returned)
>3. Sharing resources using Mesos only allows scaling

RE: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Michel Hubert

We're running Kafka 0.8.2.2
Is that the problem, why?

-Oorspronkelijk bericht-
Van: Sean Owen [mailto:so...@cloudera.com] 
Verzonden: woensdag 4 mei 2016 10:41
Aan: Michel Hubert 
CC: user@spark.apache.org
Onderwerp: Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

Please try the CDH forums; this is the Spark list:
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark

Before you even do that, I can tell you to double check you're running Kafka 
0.9.

On Wed, May 4, 2016 at 9:29 AM, Michel Hubert  wrote:
>
>
> Hi,
>
>
>
> After an upgrade to CDH 5.7.0 we have troubles with the Kafka to Spark 
> Streaming.
>
>
>
> The example jar doesn’t work:
>
>
>
> /opt/cloudera/parcels/CDH/lib/spark/bin/run-example 
> streaming.KafkaWordCount ….
>
>
>
> Attached is a log file.
>
>
>
> 16/05/04 10:06:23 WARN consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-14623491821
> 72-13b9f2e7-0-2], Error in fetch 
> kafka.consumer.ConsumerFetcherThread$FetchRequest@47fb7e6d.
> Possible cause: java.nio.BufferUnderflowException
>
> 16/05/04 10:06:24 WARN consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-14623491821
> 72-13b9f2e7-0-1], Error in fetch 
> kafka.consumer.ConsumerFetcherThread$FetchRequest@73b9a762.
> Possible cause: java.lang.IllegalArgumentException
>
>
>
> We have no problem running this from my development pc, only in 
> product op CDH 5.7.0 environment.
>
>
>
>
>
> Any ideas?
>
>
>
> Thanks,
>
> Michel
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org

substitute mapPartitions by distinct

2016-05-04 Thread Batselem

Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative
algorithm. I have discovered that it is possible to substitute RDD
mapPartitions for RDD distinct. 
First I partitioned the RDD and distinct it locally using mapPartitions
transformation. I expect it will be much faster when it comes to iterative
algorithm I checked two results and they were equal.. But I am not sure it
works correctly. my concern is that it does not guarantee that it will
remove all duplicates accurately. because hash codes can collide in some
times. and if duplicates are in different
partitions, the following code doesn't work. so all the same duplicates
should be in the same partition. Any suggestion will be appreciated. 

Code: 
PairRDD = inputPairs.partitionBy(new HashPartitioner(slices)) 

val distinctCount = PairRDD.distinct().count() 

val mapPartitionCount = PairRDD.mapPartitions(iterator => { 
  iterator.toList.distinct.toIterator 
}, true).count() 

println("distinct : " + distinctCount) 

println("mapPartitionCount : " + mapPartitionCount)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/substitute-mapPartitions-by-distinct-tp26876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Tobias Eriksson

Hi Andy,
 We have a very simple approach I think, we do like this

  1.  Submit our Spark application to the Spark Master (version 1.6.1.)
  2.  Our Application creates a Spark Context that we use throughout
  3.  We use Spray REST server
  4.  Every request that comes in we simply serve by querying Cassandra doing 
some joins and some processing, and returns JSON as a result back on the 
REST-API
  5.  We to take advantage of co-locating the Spark Workers with the Cassandra 
Nodes to “boost” performance (in our test lab we have a 4 node cluster)

Performance wise we have had some challenges but that has had to do with how 
the data was arranged in Cassandra, after changing to the 
time-series-design-pattern we improved our performance dramatically, 750 times 
in our test lab.

But now the problem is that we have more Spark applications running 
concurrently/in parallell and we are then forced to scale down on the number of 
cores that OUR application can use to ensure that we give way for other 
applications to come in a “play” too. This is not optimal, cause if there is 
free resources then I would like to use them

When it comes to having load balancing the REST requests, then in my case I 
will not have that many clients, yet in my case I think that I could scale by 
adding multiple instances of my Spark Applications, but would obviously suffer 
in having to share the resources between the different Spark Workers (say 
cores). Or I would have to use dynamic resourcing.
But as I started out my question here this is where I struggle, I need to get 
this right with sharing the resources.
This is a challenges since I rely on that I HAVE TO co-locate the Spark Workers 
and Cassandra Nodes, meaning that I can not have 3 out of 4 nodes, cause then 
the Cassandra access will not be efficient since I use 
repartitionByCassandraReplica()

Satisfying 250ms requests, well that depends very much on your use case I would 
say, and boring answer :-( sorry

Regards
 Tobias

From: Andy Davidson 
>
Date: Tuesday 3 May 2016 at 17:26
To: Tobias Eriksson 
>, 
"user@spark.apache.org" 
>
Subject: Re: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi Tobias

I am very interested implemented rest based api on top of spark. My rest based 
system would make predictions from data provided in the request using models 
trained in batch. My SLA is 250 ms.

Would you mind sharing how you implemented your rest server?

I am using spark-1.6.1. I have several unit tests that create spark context, 
master is set to ‘local[4]’. I do not think the unit test frame is going to 
scale. Can each rest server have a pool of sparks contexts?


The system would like to replacing is set up as follows

Layer of dumb load balancers: l1, l2, l3
Layer of proxy servers:   p1, p2, p3, p4, p5, ….. Pn
Layer of containers:  c1, c2, c3, ….. Cn

Where Cn is much larger than Pn


Kind regards

Andy

P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in the 
near future.
https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar_content=_source=brighttalk-portal_medium=calendar_term=

From: Tobias Eriksson 
>
Date: Tuesday, May 3, 2016 at 7:34 AM
To: "user @spark" >
Subject: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi
 We are using Spark for a long running job, in fact it is a REST-server that 
does some joins with some tables in Casandra and returns the result.
Now we need to have multiple applications running in the same Spark cluster, 
and from what I understand this is not possible, or should I say somewhat 
complicated

  1.  A Spark application takes all the resources / nodes in the cluster (we 
have 4 nodes one for each Cassandra Node)
  2.  A Spark application returns it’s resources when it is done (exits or the 
context is closed/returned)
  3.  Sharing resources using Mesos only allows scaling down and then scaling 
up by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, … And increases as 
the need increases

But if this is true, I can not have several applications running in parallell, 
is that true ?
If I use Mesos then the whole idea with one Spark Worker per Cassandra Node 
fails, as it talks directly to a node, and that is how it is so efficient.
In this case I need all nodes, not 3 out of 4.

Any mistakes in my thinking ?
Any ideas on how to solve this ? Should be a common problem I think

-Tobias

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Sean Owen

Please try the CDH forums; this is the Spark list:
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark

Before you even do that, I can tell you to double check you're running
Kafka 0.9.

On Wed, May 4, 2016 at 9:29 AM, Michel Hubert  wrote:
>
>
> Hi,
>
>
>
> After an upgrade to CDH 5.7.0 we have troubles with the Kafka to Spark
> Streaming.
>
>
>
> The example jar doesn’t work:
>
>
>
> /opt/cloudera/parcels/CDH/lib/spark/bin/run-example streaming.KafkaWordCount
> ….
>
>
>
> Attached is a log file.
>
>
>
> 16/05/04 10:06:23 WARN consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-1462349182172-13b9f2e7-0-2],
> Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest@47fb7e6d.
> Possible cause: java.nio.BufferUnderflowException
>
> 16/05/04 10:06:24 WARN consumer.ConsumerFetcherThread:
> [ConsumerFetcherThread-wordcount1_host81436-cld.domain.com-1462349182172-13b9f2e7-0-1],
> Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest@73b9a762.
> Possible cause: java.lang.IllegalArgumentException
>
>
>
> We have no problem running this from my development pc, only in product op
> CDH 5.7.0 environment.
>
>
>
>
>
> Any ideas?
>
>
>
> Thanks,
>
> Michel
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Alonso Isidoro Roman

I agree with Deepak and i would try to save data in parquet and avro
format, if you can, try to measure the performance and choose the best, it
will probably be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 9:22 GMT+02:00 Jörn Franke :

> Look at lambda architecture.
>
> What is the motivation of your migration?
>
> On 04 May 2016, at 03:29, Tapan Upadhyay  wrote:
>
> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Jörn Franke

Look at lambda architecture.

What is the motivation of your migration?

> On 04 May 2016, at 03:29, Tapan Upadhyay  wrote:
> 
> Hi,
> 
> We are planning to move our adhoc queries from teradata to spark. We have 
> huge volume of queries during the day. What is best way to go about it - 
> 
> 1) Read data directly from teradata db using spark jdbc
> 
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet and 
> then run queries on hive tables using spark sql or spark hive context.
> 
> any other ways through which we can do it in a better/efficiently?
> 
> Please guide.
> 
> Regards,
> Tapan
>

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Mich Talebzadeh

Hi,

How are you going to sync your data following migration?

Spark SQL is a tool for querying data. It is not a database per se like
Hive or anything else.

I am just doing the same migrating Sybase IQ to Hive.

Sqoop can do the initial ELT (read ELT not ETL). In other words use Sqoop
to get data as is from Teradata to Hive table and then use Hive for further
cleansing etc.

It all depends how you want to approach this and how many tables are
involved and your schema. For example are we talking about FACT tables
only. You can easily keep your DIMENSION tables in Teradata and use Spark
SQL to load data from Teradata and Hive.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 4 May 2016 at 02:29, Tapan Upadhyay  wrote:

> Hi,
>
> We are planning to move our adhoc queries from teradata to spark. We have
> huge volume of queries during the day. What is best way to go about it -
>
> 1) Read data directly from teradata db using spark jdbc
>
> 2) Import data using sqoop by EOD jobs into hive tables stored as parquet
> and then run queries on hive tables using spark sql or spark hive context.
>
> any other ways through which we can do it in a better/efficiently?
>
> Please guide.
>
> Regards,
> Tapan
>
>

60 matches

Mail list logo