Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Thanks
So,

1) For joins (stream-batch) - are all types of joins supported - i mean
inner, leftouter etc or specific ones?
Also what is the timeline for complete support - I mean stream-stream joins?

2) So now outputMode is exposed via DataFrameWriter but will work in
specific cases as you mentioned? We were looking for delta & append output
modes for aggregation/groupBy. What is the timeline for that?

Thanks
Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091p27093.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
I accidentally deleted the original post.

So I am just pasting the response from Tathagata Das 

Join is supported but only stream-batch joins.
Outmodes were added late last week, currently supports append mode for
non-aggregation queries and complete mode for aggregation queries.
And with complete mode, groupby should also be supported now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091p27092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi, 
I am Ravi, Computer scientist @ Adobe Systems. We have been actively using
Spark for our internal projects. Recently we had a need for ETL on streaming
data, so we were exploring Spark 2.0 for that.
*But as i could see, the streaming dataframes do not support basic
operations like Joins, groupBy etc. Also outputMode is not something which
is exposed via DataFrameWriter as of now.*
*So I wanted to know what are the timelines for adding support for all these
things in spark.* Because accordingly we will have to choose the right
direction to meet timelines of our project in Adobe. 

Thanks 
Ravi 
Computer Scientist 
raagg...@adobe.com



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-05 Thread raaggarw
Hi,
I am Ravi, Computer scientist @ Adobe Systems. We have been actively using
Spark for our internal projects. Recently we had a need for ETL on streaming
data, so we were exploring Spark 2.0 for that. 
*But as i could see, the streaming dataframes do not support basic
operations like Joins, groupBy etc. Also outputMode is not something which
is exposed via DataFrameWriter as of now.*
*So I wanted to know what are the timelines for adding support for all these
things in spark*. Because accordingly we will have to choose the right
direction to meet timelines of our project in Adobe.

Thanks
Ravi
Computer Scientist
raagg...@adobe.com



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Performance of Spark/MapReduce

2016-06-05 Thread Deepak Goel
Hey

Namaskara~Nalama~Guten Tag~Bonjour


Sorry about that (The question might still be general as I am new to Spark).

My question is:

Spark claims to be 10x times faster on disk and 100x times faster in memory
as compared to Mapreduce. Is there any benchmark paper for the same which
sketches out the details? Is the benchmark true for all
applications/platforms or for a particular platform?

Also has someone made a study as to what are the changes in Spark as
compared to Mapreduce which causes the performance improvement.

For example:

Change A in Spark v/s Mapreduce (Multiple Spill files in Mapper) > %
Reduction in the number of instructions > 2X times the performance
benefit  --- > Any disadvantages like availability or conditions that the
system should have multiple Disk I/O Channels

Change B in Spark v/s Mapreduce (Difference in data consolidation in
Reducer) --- % Reduction in the number of instructions --> 1.5X times the
performance benefit > Any disadvantages like availability

And so on...

Also has a cost analysis been included in such a kind of study. Any case
studies?

Deepak











===

Two questions:

1. Is this related to the thread in any way? If not, please start a new
one, otherwise you confuse people like myself.

2. This question is so general, do you understand the similarities and
differences between spark and mapreduce? Learn first, then ask questions.

Spark can map-reduce.

Sent from my iPhone

On Jun 5, 2016, at 4:37 PM, Deepak Goel  wrote:

Hello

Sorry, I am new to Spark.

Spark claims it can do all that what MapReduce can do (and more!) but 10X
times faster on disk, and 100X faster in memory. Why would then I use
Mapreduce at all?

Thanks
Deepak

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"



   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"


Unsubscribe

2016-06-05 Thread goutham koneru



RE: GraphX Java API

2016-06-05 Thread Santoshakhilesh
Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD> 
vertices, JavaRDD edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies
Reifier at Strata Hadoop World
Reifier at Spark Summit 
2015




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx
•   
org.apache.spark.graphx.impl
•   
org.apache.spark.graphx.lib
•   
org.apache.spark.graphx.util
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
>; 
user@spark.apache.org
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Indeed!

I wasn't able to get this to work in cluster mode, yet, but increasing
driver and executor stack sizes in client mode (still running on a YARN EMR
cluster) got it to work! I'll fiddle more.

FWIW, I used

spark-submit --deploy-mode client --conf
"spark.executor.extraJavaOptions=-XX:ThreadStackSize=81920" --conf
"spark.driver.extraJavaOptions=-XX:ThreadStackSize=81920" 

Thank you so much!

On Sun, Jun 5, 2016 at 2:34 PM, Eugene Morozov 
wrote:

> Everett,
>
> try to increase thread stack size. To do that run your application with
> the following options (my app is a web application, so you might adjust
> something): -XX:ThreadStackSize=81920
> -Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920"
>
> The number 81920 is memory in KB. You could try smth less. It's pretty
> memory consuming to have 80M for each thread (very simply there might be
> 100 of them), but this is just a workaround. This is configuration that I
> use to train random forest with input of 400k samples.
>
> Hope this helps.
>
> --
> Be well!
> Jean Morozov
>
> On Sun, Jun 5, 2016 at 11:17 PM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi!
>>
>> I have a fairly simple Spark (1.6.1) Java RDD-based program that's
>> scanning through lines of about 1000 large text files of records and
>> computing some metrics about each line (record type, line length, etc).
>> Most are identical so I'm calling distinct().
>>
>> In the loop over the list of files, I'm saving up the resulting RDDs into
>> a List. After the loop, I use the JavaSparkContext union(JavaRDD...
>> rdds) method to collapse the tables into one.
>>
>> Like this --
>>
>> List allMetrics = ...
>> for (int i = 0; i < files.size(); i++) {
>>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>>JavaRDD distinctFileMetrics =
>> lines.flatMap(fn).distinct();
>>
>>allMetrics.add(distinctFileMetrics);
>> }
>>
>> JavaRDD finalOutput =
>> jsc.union(allMetrics.toArray(...)).coalesce(10);
>> finalOutput.saveAsTextFile(...);
>>
>> There are posts suggesting
>> 
>> that using JavaRDD union(JavaRDD other) many times creates a long
>> lineage that results in a StackOverflowError.
>>
>> However, I'm seeing the StackOverflowError even with JavaSparkContext
>> union(JavaRDD... rdds).
>>
>> Should this still be happening?
>>
>> I'm using the work-around from this 2014 thread
>> ,
>>  shown
>> below, which requires checkpointing to HDFS every N iterations, but it's
>> ugly and decreases performance.
>>
>> Is there a lighter weight way to compact the lineage? It looks like at
>> some point there might've been a "local checkpoint
>> "
>> feature?
>>
>> Work-around:
>>
>> List allMetrics = ...
>> for (int i = 0; i < files.size(); i++) {
>>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>>JavaRDD distinctFileMetrics =
>> lines.flatMap(fn).distinct();
>>allMetrics.add(distinctFileMetrics);
>>
>>// Union and checkpoint occasionally to reduce lineage
>>if (i % tablesPerCheckpoint == 0) {
>>JavaRDD dataSoFar =
>>jsc.union(allMetrics.toArray(...));
>>dataSoFar.checkpoint();
>>dataSoFar.count();
>>allMetrics.clear();
>>allMetrics.add(dataSoFar);
>>}
>> }
>>
>> When the StackOverflowError happens, it's a long trace starting with --
>>
>> 16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number of 
>> 20823 executor(s).
>> 16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3 time(s) in 
>> a row.
>> java.lang.StackOverflowError
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>  at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>  at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>
>> ...
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


Specify node where driver should run

2016-06-05 Thread Saiph Kappa
Hi,

In yarn-cluster mode, is there any way to specify on which node I want the
driver to run?

Thanks.


Performance of Spark/MapReduce

2016-06-05 Thread Deepak Goel
Hello

Sorry, I am new to Spark.

Spark claims it can do all that what MapReduce can do (and more!) but 10X
times faster on disk, and 100X faster in memory. Why would then I use
Mapreduce at all?

Thanks
Deepak

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Mon, Jun 6, 2016 at 2:44 AM, Jacek Laskowski  wrote:

> Hi,
>
> "I am supposed to work with akka and Hadoop in building apps on top of
> the data available in hadoop" <-- that's outside the topics covered in
> this mailing list (unless you're going to use Spark, too).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Jun 5, 2016 at 9:39 PM, KhajaAsmath Mohammed
>  wrote:
> > Hi Everyone,
> >
> > I am have done lot of examples in spark and have good overview of how it
> > works. I am going to join new project where I am supposed to work with
> akka
> > and Hadoop in building apps on top of the data available in hadoop.
> >
> > Does anyone have any use case of how this work or any tutorials. I am
> sorry
> > if I had asked this in wrong place.
> >
> > Thanks,
> > Asmath.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Eugene Morozov
Marco,

I'd say yes, because it uses different implementation of hadoop's
InputFormat interface underneath.
What kind of proof would you like to see?

--
Be well!
Jean Morozov

On Sun, Jun 5, 2016 at 12:50 PM, Marco Capuccini <
marco.capucc...@farmbio.uu.se> wrote:

> Dear all,
>
> Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only finding
> contrasting opinion on that.
>
> Regards
> Marco Capuccini
>


Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Thank you.
I added this as dependency
libraryDependencies += "com.databricks" % "apps.twitter_classifier" % "1.0.0"
That number at the end I chose arbitrary? Is that correct 
Also in my TwitterAnalyzer.scala I added this linw
import com.databricks.apps.twitter_classifier._


Now I am getting this error
[info] Resolving com.databricks#apps.twitter_classifier;1.0.0 ...[warn]  module 
not found: com.databricks#apps.twitter_classifier;1.0.0[warn]  local: 
tried[warn]   
/home/hduser/.ivy2/local/com.databricks/apps.twitter_classifier/1.0.0/ivys/ivy.xml[warn]
  public: tried[warn]   
https://repo1.maven.org/maven2/com/databricks/apps.twitter_classifier/1.0.0/apps.twitter_classifier-1.0.0.pom[info]
 Resolving org.fusesource.jansi#jansi;1.4 ...[warn]  
::[warn]  ::          UNRESOLVED 
DEPENDENCIES         ::[warn]  
::[warn]  :: 
com.databricks#apps.twitter_classifier;1.0.0: not found[warn]  
::[warn][warn]  Note: Unresolved 
dependencies path:[warn]          com.databricks:apps.twitter_classifier:1.0.0 
(/home/hduser/scala/TwitterAnalyzer/build.sbt#L18-19)[warn]            +- 
scala:scala_2.10:1.0sbt.ResolveException: unresolved dependency: 
com.databricks#apps.twitter_classifier;1.0.0: not found
Any ideas?
regards, 

On Sunday, 5 June 2016, 22:22, Jacek Laskowski  wrote:
 

 On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar
 wrote:

> Now I have added this
>
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
>
> However, I am getting an error
>
>
> error: No implicit for Append.Value[Seq[sbt.ModuleID],
> sbt.impl.GroupArtifactID] found,
>  so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
>                    ^
> [error] Type error in expression
> Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Missing version element, e.g.

libraryDependencies += "com.databricks" %  "apps.twitter_classifier" %
"VERSION_HERE"

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Eugene Morozov
Everett,

try to increase thread stack size. To do that run your application with the
following options (my app is a web application, so you might adjust
something): -XX:ThreadStackSize=81920
-Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920"

The number 81920 is memory in KB. You could try smth less. It's pretty
memory consuming to have 80M for each thread (very simply there might be
100 of them), but this is just a workaround. This is configuration that I
use to train random forest with input of 400k samples.

Hope this helps.

--
Be well!
Jean Morozov

On Sun, Jun 5, 2016 at 11:17 PM, Everett Anderson 
wrote:

> Hi!
>
> I have a fairly simple Spark (1.6.1) Java RDD-based program that's
> scanning through lines of about 1000 large text files of records and
> computing some metrics about each line (record type, line length, etc).
> Most are identical so I'm calling distinct().
>
> In the loop over the list of files, I'm saving up the resulting RDDs into
> a List. After the loop, I use the JavaSparkContext union(JavaRDD...
> rdds) method to collapse the tables into one.
>
> Like this --
>
> List allMetrics = ...
> for (int i = 0; i < files.size(); i++) {
>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>JavaRDD distinctFileMetrics =
> lines.flatMap(fn).distinct();
>
>allMetrics.add(distinctFileMetrics);
> }
>
> JavaRDD finalOutput =
> jsc.union(allMetrics.toArray(...)).coalesce(10);
> finalOutput.saveAsTextFile(...);
>
> There are posts suggesting
> 
> that using JavaRDD union(JavaRDD other) many times creates a long
> lineage that results in a StackOverflowError.
>
> However, I'm seeing the StackOverflowError even with JavaSparkContext
> union(JavaRDD... rdds).
>
> Should this still be happening?
>
> I'm using the work-around from this 2014 thread
> ,
>  shown
> below, which requires checkpointing to HDFS every N iterations, but it's
> ugly and decreases performance.
>
> Is there a lighter weight way to compact the lineage? It looks like at
> some point there might've been a "local checkpoint
> "
> feature?
>
> Work-around:
>
> List allMetrics = ...
> for (int i = 0; i < files.size(); i++) {
>JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
>JavaRDD distinctFileMetrics =
> lines.flatMap(fn).distinct();
>allMetrics.add(distinctFileMetrics);
>
>// Union and checkpoint occasionally to reduce lineage
>if (i % tablesPerCheckpoint == 0) {
>JavaRDD dataSoFar =
>jsc.union(allMetrics.toArray(...));
>dataSoFar.checkpoint();
>dataSoFar.count();
>allMetrics.clear();
>allMetrics.add(dataSoFar);
>}
> }
>
> When the StackOverflowError happens, it's a long trace starting with --
>
> 16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number of 
> 20823 executor(s).
> 16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3 time(s) in 
> a row.
> java.lang.StackOverflowError
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>
> ...
>
> Thanks!
>
> - Everett
>
>


Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Jacek Laskowski
On Sun, Jun 5, 2016 at 9:01 PM, Ashok Kumar
 wrote:

> Now I have added this
>
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
>
> However, I am getting an error
>
>
> error: No implicit for Append.Value[Seq[sbt.ModuleID],
> sbt.impl.GroupArtifactID] found,
>   so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
> libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
> ^
> [error] Type error in expression
> Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Missing version element, e.g.

libraryDependencies += "com.databricks" %  "apps.twitter_classifier" %
"VERSION_HERE"

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Akka with Hadoop/Spark

2016-06-05 Thread Jacek Laskowski
Hi,

"I am supposed to work with akka and Hadoop in building apps on top of
the data available in hadoop" <-- that's outside the topics covered in
this mailing list (unless you're going to use Spark, too).

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Jun 5, 2016 at 9:39 PM, KhajaAsmath Mohammed
 wrote:
> Hi Everyone,
>
> I am have done lot of examples in spark and have good overview of how it
> works. I am going to join new project where I am supposed to work with akka
> and Hadoop in building apps on top of the data available in hadoop.
>
> Does anyone have any use case of how this work or any tutorials. I am sorry
> if I had asked this in wrong place.
>
> Thanks,
> Asmath.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



StackOverflowError even with JavaSparkContext union(JavaRDD... rdds)

2016-06-05 Thread Everett Anderson
Hi!

I have a fairly simple Spark (1.6.1) Java RDD-based program that's scanning
through lines of about 1000 large text files of records and computing some
metrics about each line (record type, line length, etc). Most are identical
so I'm calling distinct().

In the loop over the list of files, I'm saving up the resulting RDDs into a
List. After the loop, I use the JavaSparkContext union(JavaRDD... rdds)
method to collapse the tables into one.

Like this --

List allMetrics = ...
for (int i = 0; i < files.size(); i++) {
   JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
   JavaRDD distinctFileMetrics =
lines.flatMap(fn).distinct();

   allMetrics.add(distinctFileMetrics);
}

JavaRDD finalOutput =
jsc.union(allMetrics.toArray(...)).coalesce(10);
finalOutput.saveAsTextFile(...);

There are posts suggesting

that using JavaRDD union(JavaRDD other) many times creates a long
lineage that results in a StackOverflowError.

However, I'm seeing the StackOverflowError even with JavaSparkContext
union(JavaRDD... rdds).

Should this still be happening?

I'm using the work-around from this 2014 thread
,
shown
below, which requires checkpointing to HDFS every N iterations, but it's
ugly and decreases performance.

Is there a lighter weight way to compact the lineage? It looks like at some
point there might've been a "local checkpoint
"
feature?

Work-around:

List allMetrics = ...
for (int i = 0; i < files.size(); i++) {
   JavaPairRDD<...> lines = jsc.newAPIHadoopFile(...);
   JavaRDD distinctFileMetrics =
lines.flatMap(fn).distinct();
   allMetrics.add(distinctFileMetrics);

   // Union and checkpoint occasionally to reduce lineage
   if (i % tablesPerCheckpoint == 0) {
   JavaRDD dataSoFar =
   jsc.union(allMetrics.toArray(...));
   dataSoFar.checkpoint();
   dataSoFar.count();
   allMetrics.clear();
   allMetrics.add(dataSoFar);
   }
}

When the StackOverflowError happens, it's a long trace starting with --

16/06/05 18:01:29 INFO YarnAllocator: Driver requested a total number
of 20823 executor(s).
16/06/05 18:01:29 WARN ApplicationMaster: Reporter thread fails 3
time(s) in a row.
java.lang.StackOverflowError
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

...

Thanks!

- Everett


Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-05 Thread Daniel Darabos
If you fill up the cache, 1.6.0+ will suffer performance degradation from
GC thrashing. You can set spark.memory.useLegacyMode to true, or
spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to
-XX:NewRatio=3 to avoid this issue.

I think my colleague filed a ticket for this issue, but I can't find it
now. So treat it like unverified rumor for now, and try it for yourself if
you're out of better ideas :). Good luck!

On Sat, Jun 4, 2016 at 11:49 AM, Cosmin Ciobanu  wrote:

> Microbatch is 20 seconds. We’re not using window operations.
>
>
>
> The graphs are for a test cluster, and the entire load is artificially
> generated by load tests (100k / 200k generated sessions).
>
>
>
> We’ve performed a few more performance tests. On the same 5 node cluster,
> with the same application:
>
> · Spark 1.5.1 handled 170k+ generated sessions for 24hours with
> no scheduling delay – the limit seems to be around 180k, above which
> scheduling delay starts to increase;
>
> · Spark 1.6.1 had constant upward-trending scheduling delay from
> the beginning for 100k+ generated sessions (this is also mentioned in the
> initial post) – the load test was stopped after 25 minutes as scheduling
> delay reached 3,5 minutes.
>
>
>
> P.S. Florin and I will be in SF next week, attending the Spark Summit on
> Tuesday and Wednesday. We can meet and go into more details there - is
> anyone working on Spark Streaming available?
>
>
>
> Cosmin
>
>
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Saturday 4 June 2016 at 12:33
> *To: *Florin Broască 
> *Cc: *David Newberger , Adrian Tanase <
> atan...@adobe.com>, "user@spark.apache.org" ,
> ciobanu 
> *Subject: *Re: [REPOST] Severe Spark Streaming performance degradation
> after upgrading to 1.6.1
>
>
>
> batch interval I meant
>
>
>
> thx
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 4 June 2016 at 10:32, Mich Talebzadeh 
> wrote:
>
> I may have missed these but:
>
>
>
> What is the windows interval, windowsLength and SlidingWindow
>
>
>
> Has the volume of ingest data (Kafka streaming) changed recently that you
> may not be aware of?
>
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 4 June 2016 at 09:50, Florin Broască  wrote:
>
> Hi David,
>
>
>
> Thanks for looking into this. This is how the processing time looks like:
>
>
>
> [image: nline image 1]
>
>
>
> Appreciate any input,
>
> Florin
>
>
>
>
>
> On Fri, Jun 3, 2016 at 3:22 PM, David Newberger <
> david.newber...@wandcorp.com> wrote:
>
> What does your processing time look like. Is it consistently within that
> 20sec micro batch window?
>
>
>
> *David Newberger*
>
>
>
> *From:* Adrian Tanase [mailto:atan...@adobe.com]
> *Sent:* Friday, June 3, 2016 8:14 AM
> *To:* user@spark.apache.org
> *Cc:* Cosmin Ciobanu
> *Subject:* [REPOST] Severe Spark Streaming performance degradation after
> upgrading to 1.6.1
>
>
>
> Hi all,
>
>
>
> Trying to repost this question from a colleague on my team, somehow his
> subscription is not active:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html
>
>
>
> Appreciate any thoughts,
>
> -adrian
>
>
>
>
>
>
>


Akka with Hadoop/Spark

2016-06-05 Thread KhajaAsmath Mohammed
Hi Everyone,

I am have done lot of examples in spark and have good overview of how it
works. I am going to join new project where I am supposed to work with akka
and Hadoop in building apps on top of the data available in hadoop.

Does anyone have any use case of how this work or any tutorials. I am sorry
if I had asked this in wrong place.

Thanks,
Asmath.


Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hello for 1, I read the doc as
libraryDependencies += groupID % artifactID % revision
jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep CheckpointDirectory
  com/databricks/apps/twitter_classifier/getCheckpointDirectory.class
  getCheckpointDirectory.class

Now I have added this
libraryDependencies += "com.databricks" %  "apps.twitter_classifier"

However, I am getting an error
error: No implicit for Append.Value[Seq[sbt.ModuleID], 
sbt.impl.GroupArtifactID] found,
  so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
libraryDependencies += "com.databricks" %  "apps.twitter_classifier"
^
[error] Type error in expression
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Any ideas very appreciated
Thanking yoou


 

On Sunday, 5 June 2016, 17:39, Ted Yu  wrote:
 

 For #1, please find examples on the nete.g.
http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html

For #2,
import . getCheckpointDirectory
Cheers
On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar  wrote:

Thank you sir.
At compile time can I do something similar to
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"

I have these
name := "scala"
version := "1.0"
scalaVersion := "2.10.4"
And if I look at jar file i have

jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check  1180 Sun Jun 05 
10:14:36 BST 2016 
com/databricks/apps/twitter_classifier/getCheckpointDirectory.class  1043 Sun 
Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class  1216 Fri Sep 18 09:12:40 
BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class   
615 Fri Sep 18 09:12:40 BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class
two questions please
What do I need to put in libraryDependencies line
and what do I need to add to the top of scala app like
import java.io.Fileimport org.apache.log4j.Loggerimport 
org.apache.log4j.Levelimport ?
Thanks 



 

On Sunday, 5 June 2016, 15:21, Ted Yu  wrote:
 

 At compilation time, you need to declare the dependence on 
getCheckpointDirectory.
At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass 
the jar.
Cheers
On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar  
wrote:

Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks







   



  

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and
the detail exception information? Or it's better to paste your code and
exception here if it's applicable, then other members can help you to
diagnose the problem.

Thanks
Yanbo

2016-05-12 2:03 GMT-07:00 AlexModestov :

> Hello,
> I have the same problem... Sometimes I get the error: "Py4JError: Answer
> from Java side is empty"
> Sometimes my code works fine but sometimes not...
> Did you find why it might come? What was the reason?
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ML-regression-spark-context-dies-without-error-tp22633p26938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
For #1, please find examples on the net
e.g.

http://www.scala-sbt.org/0.13/docs/Scala-Files-Example.html

For #2,

import . getCheckpointDirectory

Cheers

On Sun, Jun 5, 2016 at 8:36 AM, Ashok Kumar  wrote:

> Thank you sir.
>
> At compile time can I do something similar to
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
>
> I have these
>
> name := "scala"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> And if I look at jar file i have
>
>
> jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check
>   1180 Sun Jun 05 10:14:36 BST 2016
> com/databricks/apps/twitter_classifier/getCheckpointDirectory.class
>   1043 Sun Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class
>   1216 Fri Sep 18 09:12:40 BST 2015
> scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class
>615 Fri Sep 18 09:12:40 BST 2015
> scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class
>
> two questions please
>
> What do I need to put in libraryDependencies line
>
> and what do I need to add to the top of scala app like
>
> import java.io.File
> import org.apache.log4j.Logger
> import org.apache.log4j.Level
> import ?
>
> Thanks
>
>
>
>
>
> On Sunday, 5 June 2016, 15:21, Ted Yu  wrote:
>
>
> At compilation time, you need to declare the dependence
> on getCheckpointDirectory.
>
> At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to
> pass the jar.
>
> Cheers
>
> On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar 
> wrote:
>
> Hi all,
>
> Appreciate any advice on this. It is about scala
>
> I have created a very basic Utilities.scala that contains a test class and
> method. I intend to add my own classes and methods as I expand and make
> references to these classes and methods in my other apps
>
> class getCheckpointDirectory {
>   def CheckpointDirectory (ProgramName: String) : String  = {
>  var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName
>  return hdfsDir
>   }
> }
> I have used sbt to create a jar file for it. It is created as a jar file
>
> utilities-assembly-0.1-SNAPSHOT.jar
>
> Now I want to make a call to that method CheckpointDirectory in my app
> code myapp.dcala to return the value for hdfsDir
>
>val ProgramName = this.getClass.getSimpleName.trim
>val getCheckpointDirectory =  new getCheckpointDirectory
>val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName)
>
> However, I am getting a compilation error as expected
>
> not found: type getCheckpointDirectory
> [error] val getCheckpointDirectory =  new getCheckpointDirectory
> [error]   ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
>
> So a basic question, in order for compilation to work do I need to create
> a package for my jar file or add dependency like the following I do in sbt
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>
>
> Any advise will be appreciated.
>
> Thanks
>
>
>
>
>
>
>
>


Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Thank you sir.
At compile time can I do something similar to
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"

I have these
name := "scala"
version := "1.0"
scalaVersion := "2.10.4"
And if I look at jar file i have

jar tvf utilities-assembly-0.1-SNAPSHOT.jar|grep Check  1180 Sun Jun 05 
10:14:36 BST 2016 
com/databricks/apps/twitter_classifier/getCheckpointDirectory.class  1043 Sun 
Jun 05 10:14:36 BST 2016 getCheckpointDirectory.class  1216 Fri Sep 18 09:12:40 
BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask$class.class   
615 Fri Sep 18 09:12:40 BST 2015 
scala/collection/parallel/ParIterableLike$StrictSplitterCheckTask.class
two questions please
What do I need to put in libraryDependencies line
and what do I need to add to the top of scala app like
import java.io.Fileimport org.apache.log4j.Loggerimport 
org.apache.log4j.Levelimport ?
Thanks 



 

On Sunday, 5 June 2016, 15:21, Ted Yu  wrote:
 

 At compilation time, you need to declare the dependence on 
getCheckpointDirectory.
At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to pass 
the jar.
Cheers
On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar  
wrote:

Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks







  

Re: Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ted Yu
At compilation time, you need to declare the dependence
on getCheckpointDirectory.

At runtime, you can use '--jars utilities-assembly-0.1-SNAPSHOT.jar' to
pass the jar.

Cheers

On Sun, Jun 5, 2016 at 3:06 AM, Ashok Kumar 
wrote:

> Hi all,
>
> Appreciate any advice on this. It is about scala
>
> I have created a very basic Utilities.scala that contains a test class and
> method. I intend to add my own classes and methods as I expand and make
> references to these classes and methods in my other apps
>
> class getCheckpointDirectory {
>   def CheckpointDirectory (ProgramName: String) : String  = {
>  var hdfsDir = "hdfs://host:9000/user/user/checkpoint/"+ProgramName
>  return hdfsDir
>   }
> }
> I have used sbt to create a jar file for it. It is created as a jar file
>
> utilities-assembly-0.1-SNAPSHOT.jar
>
> Now I want to make a call to that method CheckpointDirectory in my app
> code myapp.dcala to return the value for hdfsDir
>
>val ProgramName = this.getClass.getSimpleName.trim
>val getCheckpointDirectory =  new getCheckpointDirectory
>val hdfsDir = getCheckpointDirectory.CheckpointDirectory(ProgramName)
>
> However, I am getting a compilation error as expected
>
> not found: type getCheckpointDirectory
> [error] val getCheckpointDirectory =  new getCheckpointDirectory
> [error]   ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
>
> So a basic question, in order for compilation to work do I need to create
> a package for my jar file or add dependency like the following I do in sbt
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>
>
> Any advise will be appreciated.
>
> Thanks
>
>
>
>
>


Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
Actually it is an interesting question. Spark standalone uses simple
cluster manager that is included with Spark. However, I am not sure that
simple cluster manager can work out the whereabouts of datanodes in Hadoop
cluster. I start YARN with HDFS together so don't have this concern

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 June 2016 at 14:09, Mich Talebzadeh  wrote:

> I use YARN as I run Hive on Spark engine in yarn-cluster mode plus other
> stuff. if I turn off YARN half of my applications won't work.  I don't see
> great concern for supporting YARN. However you may have other reasons
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 June 2016 at 13:40, Marco Capuccini 
> wrote:
>
>> I meant when running in standalone cluster mode, where Hadoop data nodes
>> run on the same nodes where the Spark workers run. I don’t want to support
>> YARN as well in my infrastructure, and since I already set up a standalone
>> Spark cluster, I was wondering if running only HDFS in the same cluster
>> would be enough.
>>
>> Regards
>> Marco
>>
>> On 05 Jun 2016, at 12:17, Mich Talebzadeh 
>> wrote:
>>
>> Well in standalone mode you are running your spark code on one physical
>> node so the assumption would be that there is HDFS node running on the same
>> host.
>>
>> When you are running Spark in yarn-client mode, then Yarn is part of
>> Hadoop core and Yarn will know about the datanodes from
>> %HADOOP_HOME/etc/Hadoop/slaves
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 5 June 2016 at 10:50, Marco Capuccini 
>> wrote:
>>
>>> Dear all,
>>>
>>> Does Spark uses data locality information from HDFS, when running in
>>> standalone mode? Or is it running on YARN mandatory for such purpose? I
>>> can't find this information in the docs, and on Google I am only finding
>>> contrasting opinion on that.
>>>
>>> Regards
>>> Marco Capuccini
>>>
>>
>>
>>
>


Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
I use YARN as I run Hive on Spark engine in yarn-cluster mode plus other
stuff. if I turn off YARN half of my applications won't work.  I don't see
great concern for supporting YARN. However you may have other reasons



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 June 2016 at 13:40, Marco Capuccini 
wrote:

> I meant when running in standalone cluster mode, where Hadoop data nodes
> run on the same nodes where the Spark workers run. I don’t want to support
> YARN as well in my infrastructure, and since I already set up a standalone
> Spark cluster, I was wondering if running only HDFS in the same cluster
> would be enough.
>
> Regards
> Marco
>
> On 05 Jun 2016, at 12:17, Mich Talebzadeh 
> wrote:
>
> Well in standalone mode you are running your spark code on one physical
> node so the assumption would be that there is HDFS node running on the same
> host.
>
> When you are running Spark in yarn-client mode, then Yarn is part of
> Hadoop core and Yarn will know about the datanodes from
> %HADOOP_HOME/etc/Hadoop/slaves
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 June 2016 at 10:50, Marco Capuccini 
> wrote:
>
>> Dear all,
>>
>> Does Spark uses data locality information from HDFS, when running in
>> standalone mode? Or is it running on YARN mandatory for such purpose? I
>> can't find this information in the docs, and on Google I am only finding
>> contrasting opinion on that.
>>
>> Regards
>> Marco Capuccini
>>
>
>
>


Caching table partition after join

2016-06-05 Thread Zalzberg, Idan (Agoda)
Hi,

I have a complicated scenario where I can't seem to explain to spark how to 
handle the query in the best way.
I am using spark from the thrift server so only SQL.

To explain the scenario, let's assume:

Table A:
Key : String
Value : String

Table B:
Key: String
Value2: String
Part : String [Partitioning column]

Now, all my queries look like
Select A.Value,B.Value2 from A join B on (A.Key = B.Key) where B.Part = 
"SOMEHTING"

I would like to have a cache of the *output* of this query without knowing the 
value "SOMETHING" in advance.
I.e. I would image I can create a *lazy* view like

Create view MyView as Select A.Value,B.Value2,B.Part from A join B on (A.Key = 
B.Key)

Users will *never* query the whole view, that would be huge and blow up the 
cache. They would always do something like:

Select * from MyView where Part = "SOMETHINGELSE"

I would expect spark to be able to push down the partition filter on Part and 
then cache only those output from the query, basically keeping the view 
partitioned by Part in the same manner as B is.

I cannot pre-materialize the whole view since it will be very big, even in 
columnar storage.

Can I achieve this behavior?

Thanks



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.


Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
I meant when running in standalone cluster mode, where Hadoop data nodes run on 
the same nodes where the Spark workers run. I don’t want to support YARN as 
well in my infrastructure, and since I already set up a standalone Spark 
cluster, I was wondering if running only HDFS in the same cluster would be 
enough.

Regards
Marco

On 05 Jun 2016, at 12:17, Mich Talebzadeh 
> wrote:

Well in standalone mode you are running your spark code on one physical node so 
the assumption would be that there is HDFS node running on the same host.

When you are running Spark in yarn-client mode, then Yarn is part of Hadoop 
core and Yarn will know about the datanodes from %HADOOP_HOME/etc/Hadoop/slaves

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 5 June 2016 at 10:50, Marco Capuccini 
> wrote:
Dear all,

Does Spark uses data locality information from HDFS, when running in standalone 
mode? Or is it running on YARN mandatory for such purpose? I can't find this 
information in the docs, and on Google I am only finding contrasting opinion on 
that.

Regards
Marco Capuccini




Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
Well in standalone mode you are running your spark code on one physical
node so the assumption would be that there is HDFS node running on the same
host.

When you are running Spark in yarn-client mode, then Yarn is part of Hadoop
core and Yarn will know about the datanodes from
%HADOOP_HOME/etc/Hadoop/slaves

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 5 June 2016 at 10:50, Marco Capuccini 
wrote:

> Dear all,
>
> Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only finding
> contrasting opinion on that.
>
> Regards
> Marco Capuccini
>


Basic question on using one's own classes in the Scala app

2016-06-05 Thread Ashok Kumar
Hi all,
Appreciate any advice on this. It is about scala
I have created a very basic Utilities.scala that contains a test class and 
method. I intend to add my own classes and methods as I expand and make 
references to these classes and methods in my other apps
class getCheckpointDirectory {  def CheckpointDirectory (ProgramName: String) : 
String  = {     var hdfsDir = 
"hdfs://host:9000/user/user/checkpoint/"+ProgramName     return hdfsDir  }}I 
have used sbt to create a jar file for it. It is created as a jar file
utilities-assembly-0.1-SNAPSHOT.jar

Now I want to make a call to that method CheckpointDirectory in my app code 
myapp.dcala to return the value for hdfsDir
   val ProgramName = this.getClass.getSimpleName.trim   val 
getCheckpointDirectory =  new getCheckpointDirectory   val hdfsDir = 
getCheckpointDirectory.CheckpointDirectory(ProgramName)
However, I am getting a compilation error as expected
not found: type getCheckpointDirectory[error]     val getCheckpointDirectory =  
new getCheckpointDirectory[error]                                       
^[error] one error found[error] (compile:compileIncremental) Compilation failed
So a basic question, in order for compilation to work do I need to create a 
package for my jar file or add dependency like the following I do in sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-sql" % 
"1.5.1"libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

Any advise will be appreciated.
Thanks





Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
Dear all,

Does Spark uses data locality information from HDFS, when running in standalone 
mode? Or is it running on YARN mandatory for such purpose? I can't find this 
information in the docs, and on Google I am only finding contrasting opinion on 
that.

Regards
Marco Capuccini


Re: Using data frames to join separate RDDs in spark streaming

2016-06-05 Thread Cyril Scetbon
Problem solved by creating only one RDD.
> On Jun 1, 2016, at 14:05, Cyril Scetbon  wrote:
> 
> It seems that to join a DStream with a RDD I can use :
> 
> mgs.transform(rdd => rdd.join(rdd1))
> 
> or
> 
> mgs.foreachRDD(rdd => rdd.join(rdd1))
> 
> But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063
> 
> 
>> On Jun 1, 2016, at 12:00, Cyril Scetbon  wrote:
>> 
>> Hi guys,
>> 
>> I have a 2 input data streams that I want to join using Dataframes and 
>> unfortunately I get the message produced by 
>> https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  
>> in (2) :
>> 
>> (1)
>> val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
>>.map(r => (r._1, r._2))
>> 
>> (2)
>> mgs.map(x => x._1)
>>  .foreachRDD { rdd =>
>>val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
>>import sqlContext.implicits._
>> 
>>val df_aids = rdd.toDF("id")
>> 
>>val df = rdd1.toDF("id", "aid")
>> 
>>df.select(explode(df("aid")).as("aid"), df("id"))
>>   .join(df_aids, $"aid" === df_aids("id"))
>>   .select(df("id"), df_aids("id"))
>>  .
>>  }
>> 
>> Is there a way to still use Dataframes to do it or I need to do everything 
>> using RDDs join only ?
>> And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) 
>> and a DStream (mgs) ?
>> 
>> Thanks
>> -- 
>> Cyril SCETBON
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Nested Array of JSON with empty field

2016-06-05 Thread Ewan Leith
The spark json read is unforgiving of things like missing elements from some 
json records, or mixed types.

If you want to pass invalid json files through spark you're best doing an 
initial parse through the Jackson APIs using a defined schema first, then you 
can set types like Option[String] where a column is optional, then convert the 
validated back into a new string variable, then read the string as a dataframe.

Thanks,
Ewan

On 3 Jun 2016 22:03, Jerry Wong  wrote:
Hi,

I met a problem of empty field in the nested JSON file with Spark SQL. For 
instance,
There are two lines of JSON file as follows,

{
"firstname": "Jack",
"lastname": "Nelson",
"address": {
"state": "New York",
"city": "New York"
}
}{
"firstname": "Landy",
"middlename": "Ken",
"lastname": "Yong",
"address": {
"state": "California",
"city": "Los Angles"
}
}

I use Spark SQL to get the files like,
val row = sqlContext.sql("SELECT firstname, middlename, lastname, 
address.state, address.city FROM jsontable")
The compile will tell me the error of line1: no "middlename".
How do I handle this case in the SQL sql?

Many thanks in advance!
Jerry