Meetup in Taiwan

2017-06-25 Thread Yang Bryan
Hi,

I'm Bryan, the co-founder of Taiwan Spark User Group.
We discuss, share information on https://www.facebook.com/groups/spark.tw/.
We have physical meetup twice a month.
Please help us add on the official website.

And We will hold a code competition about Spark, could we print the logo of
Spark on the certificate of participation?

If you have any questions or suggestions, please feel free and let me know.
Thank you.

Best Regards,
Bryan


RE: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread Mahesh Sawaiker
Ayan,
The location of the logging class was moved from Spark 1.6 to Spark 2.0.
Looks like you are trying to run 1.6 code on 2.0, I have ported some code like 
this before and if you have access to the code you can recompile it by changing 
reference to Logging class and directly use the slf4 Logger class, most of the 
code tends to be easily portable.

Following is the release note for Spark 2.0

Removals, Behavior Changes and Deprecations
Removals
The following features have been removed in Spark 2.0:

  *   Bagel
  *   Support for Hadoop 2.1 and earlier
  *   The ability to configure closure serializer
  *   HTTPBroadcast
  *   TTL-based metadata cleaning
  *   Semi-private class org.apache.spark.Logging. We suggest you use slf4j 
directly.
  *   SparkContext.metricsSystem
Thanks,
Mahesh


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, June 26, 2017 6:26 AM
To: Weiqing Yang
Cc: user
Subject: Re: HDP 2.5 - Python - Spark-On-Hbase

Hi

I am using following:

--packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories 
http://repo.hortonworks.com/content/groups/public/

Is it compatible with Spark 2.X? I would like to use it

Best
Ayan

On Sat, Jun 24, 2017 at 2:09 AM, Weiqing Yang 
> wrote:
Yes.
What SHC version you were using?
If hitting any issues, you can post them in SHC github issues. There are some 
threads about this.

On Fri, Jun 23, 2017 at 5:46 AM, ayan guha 
> wrote:
Hi

Is it possible to use SHC from Hortonworks with pyspark? If so, any working 
code sample available?

Also, I faced an issue while running the samples with Spark 2.0

"Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"

Any workaround?

Thanks in advance

--
Best Regards,
Ayan Guha




--
Best Regards,
Ayan Guha
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ok.. for plain sql, I've no idea other than defining a udaf



On Mon, Jun 26, 2017 at 10:59 AM, jeff saremi 
wrote:

> My specific and immediate need is this: We have a native function wrapped
> in JNI. To increase performance we'd like to avoid calling it record by
> record. mapPartitions() give us the ability to invoke this in bulk. We're
> looking for a similar approach in SQL.
>
>
> --
> *From:* Ryan 
> *Sent:* Sunday, June 25, 2017 7:18:32 PM
> *To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: What is the equivalent of mapPartitions in SpqrkSQL?
>
> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
We are also doing transformations, thats the reason using spark streaming.
Does Spark streaming support tumbling windows? I was thinking I can use a
window operation to writing into HDFS.

Thanks

On Sun, Jun 25, 2017 at 10:23 PM, ayan guha  wrote:

> I would suggest to use Flume, if possible, as it has in built HDFS log
> rolling capabilities
>
> On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire 
> wrote:
>
>> Hi,
>>
>> I am using spark streaming with 1 minute duration to read data from kafka
>> topic, apply transformations and persist into HDFS.
>>
>> The application is creating a new directory every 1 minute with many
>> partition files(= nbr of partitions). What parameter should I need to
>> change/configure to persist and create a HDFS directory say *every 30
>> minutes* instead of duration of the spark streaming application?
>>
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Naveen
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark streaming persist to hdfs question

2017-06-25 Thread ayan guha
I would suggest to use Flume, if possible, as it has in built HDFS log
rolling capabilities

On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire 
wrote:

> Hi,
>
> I am using spark streaming with 1 minute duration to read data from kafka
> topic, apply transformations and persist into HDFS.
>
> The application is creating a new directory every 1 minute with many
> partition files(= nbr of partitions). What parameter should I need to
> change/configure to persist and create a HDFS directory say *every 30
> minutes* instead of duration of the spark streaming application?
>
>
> Any help would be appreciated.
>
> Thanks,
> Naveen
>
>
>


-- 
Best Regards,
Ayan Guha


Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
Hi,

I am using spark streaming with 1 minute duration to read data from kafka
topic, apply transformations and persist into HDFS.

The application is creating a new directory every 1 minute with many
partition files(= nbr of partitions). What parameter should I need to
change/configure to persist and create a HDFS directory say *every 30
minutes* instead of duration of the spark streaming application?


Any help would be appreciated.

Thanks,
Naveen


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
My specific and immediate need is this: We have a native function wrapped in 
JNI. To increase performance we'd like to avoid calling it record by record. 
mapPartitions() give us the ability to invoke this in bulk. We're looking for a 
similar approach in SQL.



From: Ryan 
Sent: Sunday, June 25, 2017 7:18:32 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL?

Why would you like to do so? I think there's no need for us to explicitly ask 
for a forEachPartition in spark sql because tungsten is smart enough to figure 
out whether a sql operation could be applied on each partition or there has to 
be a shuffle.

On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:

You can do a map() using a select and functions/UDFs. But how do you process a 
partition using SQL?




Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-25 Thread Ryan
have to say sorry. I check the code again, Broadcast is serializable and
should be able to use within lambdas/inner classes. actually according to
the javadoc it should be used in this way to avoid the large contained
value object's serialization.

so what's wrong with the first approach?

On Sat, Jun 24, 2017 at 4:46 AM, Anton Kravchenko <
kravchenko.anto...@gmail.com> wrote:

> ok, this one is doing what I want
>
> SparkConf conf = new SparkConf()
> .set("spark.sql.warehouse.dir", 
> "hdfs://localhost:9000/user/hive/warehouse")
> .setMaster("local[*]")
> .setAppName("TestApp");
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> SparkSession session = SparkSession
> .builder()
> .appName("TestApp").master("local[*]")
> .getOrCreate();
>
> Integer _bcv =  123;
> Broadcast bcv = sc.broadcast(_bcv);
>
> WrapBCV.setBCV(bcv); // implemented in WrapBCV.java
>
> df_sql.foreachPartition(new ProcessSinglePartition()); //implemented in 
> ProcessSinglePartition.java
>
> Where:
> ProcessSinglePartition.java
>
> public class ProcessSinglePartition implements ForeachPartitionFunction  
> {
> public void call(Iterator it) throws Exception {
> System.out.println(WrapBCV.getBCV());
> }
> }
>
> WrapBCV.java
>
> public class WrapBCV {
> private static Broadcast bcv;
> public static void setBCV(Broadcast setbcv){ bcv = setbcv; }
> public static Integer getBCV()
> {
> return bcv.value();
> }
> }
>
>
> On Fri, Jun 16, 2017 at 3:35 AM, Ryan  wrote:
>
>> I don't think Broadcast itself can be serialized. you can get the value
>> out on the driver side and refer to it in foreach, then the value would be
>> serialized with the lambda expr and sent to workers.
>>
>> On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko <
>> kravchenko.anto...@gmail.com> wrote:
>>
>>> How one would access a broadcasted variable from within
>>> ForeachPartitionFunction  Spark(2.0.1) Java API ?
>>>
>>> Integer _bcv = 123;
>>> Broadcast bcv = spark.sparkContext().broadcast(_bcv);
>>> Dataset df_sql = spark.sql("select * from atable");
>>>
>>> df_sql.foreachPartition(new ForeachPartitionFunction() {
>>> public void call(Iterator t) throws Exception {
>>> System.out.println(bcv.value());
>>> }}
>>> );
>>>
>>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Do you mean you'd like to partition the data with specific key?

If we issue a cluster by/repartition, following an operation needn't
shuffle, it's effectively the same as for each partition I think.

Or we could always get the underlying rdd from dataset, translating sql
operation to function...

On Mon, Jun 26, 2017 at 10:24 AM, Stephen Boesch  wrote:

> Spark SQL did not support explicit partitioners even before tungsten: and
> often enough this did hurt performance.  Even now Tungsten will not do the
> best job every time: so the question from the OP is still germane.
>
> 2017-06-25 19:18 GMT-07:00 Ryan :
>
>> Why would you like to do so? I think there's no need for us to explicitly
>> ask for a forEachPartition in spark sql because tungsten is smart enough to
>> figure out whether a sql operation could be applied on each partition or
>> there has to be a shuffle.
>>
>> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
>> wrote:
>>
>>> You can do a map() using a select and functions/UDFs. But how do you
>>> process a partition using SQL?
>>>
>>>
>>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance.  Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.

2017-06-25 19:18 GMT-07:00 Ryan :

> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly
ask for a forEachPartition in spark sql because tungsten is smart enough to
figure out whether a sql operation could be applied on each partition or
there has to be a shuffle.

On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
wrote:

> You can do a map() using a select and functions/UDFs. But how do you
> process a partition using SQL?
>
>
>


Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread ayan guha
Hi

I am using following:

--packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories
http://repo.hortonworks.com/content/groups/public/

Is it compatible with Spark 2.X? I would like to use it

Best
Ayan

On Sat, Jun 24, 2017 at 2:09 AM, Weiqing Yang 
wrote:

> Yes.
> What SHC version you were using?
> If hitting any issues, you can post them in SHC github issues. There are
> some threads about this.
>
> On Fri, Jun 23, 2017 at 5:46 AM, ayan guha  wrote:
>
>> Hi
>>
>> Is it possible to use SHC from Hortonworks with pyspark? If so, any
>> working code sample available?
>>
>> Also, I faced an issue while running the samples with Spark 2.0
>>
>> "Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging"
>>
>> Any workaround?
>>
>> Thanks in advance
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Riccardo Ferrari
Hi,

Looks like you performed an aggregation on the ImageWidth column already.
The error itself is quite self-explanatory:

Cannot resolve column name "ImageWidth" among (MainDomainCode,
*avg(length(ImageWidth))*)

The column available in that DF are MainDomainCode and
avg(length(ImageWidth)) so you should use the alias and rename the column
back:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

best,

On Sun, Jun 25, 2017 at 1:19 PM, Eko Susilo 
wrote:

> Hi,
>
> I have a data frame collection called “secondDf” when I tried to perform
> groupBy and then sum of each column it works perfectly. However when I
> tried to calculate average of that column it says the column name is not
> found. The details are as follow
>
> val total = secondDf.filter("ImageWidth > 1 and ImageHeight > 1").
>groupBy("MainDomainCode").
>agg(sum("ImageWidth"),
>sum("ImageHeight"),
>sum("ImageArea”))
>
>
> total.show  will show result as expected, However when I tried to
> calculate avg, the result is script error. Any help to resolve this issue?
>
> Regards,
> Eko
>
>
>   val average = secondDf.filter("ImageWidth > 1 and ImageHeight > 1").
>groupBy("MainDomainCode").
>agg(avg("ImageWidth"),
>avg("ImageHeight"),
>avg("ImageArea"))
>
>
> org.apache.spark.sql.AnalysisException: Cannot resolve column name
> "ImageWidth" among (MainDomainCode, avg(length(ImageWidth)));
> at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.
> apply(DataFrame.scala:152)
> at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.
> apply(DataFrame.scala:152)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
> at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
> at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$
> $iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.
> (:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<
> init>(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:63)
> at $iwC$$iwC$$iwC$$iwC.(:65)
> at $iwC$$iwC$$iwC.(:67)
> at $iwC$$iwC.(:69)
> at $iwC.(:71)
> at (:73)
> at .(:77)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(
> SparkIMain.scala:1065)
> at org.apache.spark.repl.SparkIMain$Request.loadAndRun(
> SparkIMain.scala:1346)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(
> SparkILoop.scala:857)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(
> SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(
> SparkILoop.scala:875)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(
> SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(
> SparkILoop.scala:875)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(
> SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(
> SparkILoop.scala:875)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(
> SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(
> SparkILoop.scala:875)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(
> SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org$apache$spark$
> repl$SparkILoop$$loop(SparkILoop.scala:670)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$
> apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$
> apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$
> 

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-25 Thread Mich Talebzadeh
This typically works ok for standalone mode with moderate resources

${SPARK_HOME}/bin/spark-submit \
--driver-memory 6G \
--executor-memory 2G \
--num-executors 2 \
--executor-cores 2 \
--master spark://50.140.197.217:7077 \
--conf "spark.scheduler.mode=FAIR" \
--conf "spark.ui.port=5" \


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 June 2017 at 17:30, Rastogi, Pankaj 
wrote:

> Please make sure that you have enough memory available on the driver node.
> If there is not enough free memory on the driver node, then your
> application won’t start.
>
> Pankaj
>
> From: vaquar khan 
> Date: Saturday, June 10, 2017 at 5:02 PM
> To: Abdulfattah Safa 
> Cc: User 
> Subject: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory >
> Executor Memory
>
> You can add memory in your command make sure given memory available on
> your executor
>
> ./bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master spark://207.184.161.138:7077 
> 
>  \
>   --executor-memory 20G \
>   --total-executor-cores 100 \
>   /path/to/examples.jar \
>   1000
>
>
> https://spark.apache.org/docs/1.1.0/submitting-applications.html
> 
>
> Also try to avoid function need memory like collect etc.
>
>
> Regards,
> Vaquar khan
>
>
> On Jun 4, 2017 5:46 AM, "Abdulfattah Safa"  wrote:
>
> I'm working on Spark with Standalone Cluster mode. I need to increase the
> Driver Memory as I got OOM in t he driver thread. If found that when
> setting  the Driver Memory to > Executor Memory, the submitted job is stuck
> at Submitted in the driver and the application never starts.
>
>
>


Re: Question on Spark code

2017-06-25 Thread kant kodali
impressive! I need to learn more about scala.

What I mean stripping away conditional check in Java is this.

static final boolean isLogInfoEnabled = false;

public void logMessage(String message) {
if(isLogInfoEnabled) {
log.info(message)
}
}

If you look at the byte code the dead if check will be removed.








On Sun, Jun 25, 2017 at 12:46 PM, Sean Owen  wrote:

> I think it's more precise to say args like any expression are evaluated
> when their value is required. It's just that this special syntax causes
> extra code to be generated that makes it effectively a function passed, not
> value, and one that's lazily evaluated. Look at the bytecode if you're
> curious.
>
> An if conditional is pretty trivial to evaluate here. I don't think that
> guidance is sound. The point is that it's not worth complicating the caller
> code in almost all cases by checking the guard condition manually.
>
> I'm not sure what you're referring to, but no, no compiler can elide these
> conditions. They're based on runtime values that can change at runtime.
>
> scala has an @elidable annotation which you can use to indicate to the
> compiler that the declaration should be entirely omitted when configured to
> elide above a certain detail level. This is how scalac elides assertions if
> you ask it to and you can do it to your own code. But this is something
> different, not what's happening here, and a fairly niche/advanced feature.
>
>
> On Sun, Jun 25, 2017 at 8:25 PM kant kodali  wrote:
>
>> @Sean Got it! I come from Java world so I guess I was wrong in assuming
>> that arguments are evaluated during the method invocation time. How about
>> the conditional checks to see if the log is InfoEnabled or DebugEnabled?
>> For Example,
>>
>> if (log.isInfoEnabled) log.info(msg)
>>
>> I hear we should use guard condition only when the string "msg"
>> construction is expensive otherwise we will be taking a performance hit
>> because of the additional "if" check unless the "log" itself is declared
>> static final and scala compiler will strip away the "if" check and produce
>> efficient byte code. Also log.info does the log.isInfoEnabled check
>> inside the body prior to logging the messages.
>>
>> https://github.com/qos-ch/slf4j/blob/master/slf4j-
>> simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L509
>> https://github.com/qos-ch/slf4j/blob/master/slf4j-
>> simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L599
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>> On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen  wrote:
>>
>>> Maybe you are looking for declarations like this. "=> String" means the
>>> arg isn't evaluated until it's used, which is just what you want with log
>>> statements. The message isn't constructed unless it will be logged.
>>>
>>> protected def logInfo(msg: => String) {
>>>
>>>
>>> On Sun, Jun 25, 2017 at 10:28 AM kant kodali  wrote:
>>>
 Hi All,

 I came across this file https://github.com/
 apache/spark/blob/master/core/src/main/scala/org/apache/
 spark/internal/Logging.scala and I am wondering what is the purpose of
 this? Especially it doesn't prevent any string concatenation and also the
 if checks are already done by the library itself right?

 Thanks!


>>


Re: Question on Spark code

2017-06-25 Thread Sean Owen
I think it's more precise to say args like any expression are evaluated
when their value is required. It's just that this special syntax causes
extra code to be generated that makes it effectively a function passed, not
value, and one that's lazily evaluated. Look at the bytecode if you're
curious.

An if conditional is pretty trivial to evaluate here. I don't think that
guidance is sound. The point is that it's not worth complicating the caller
code in almost all cases by checking the guard condition manually.

I'm not sure what you're referring to, but no, no compiler can elide these
conditions. They're based on runtime values that can change at runtime.

scala has an @elidable annotation which you can use to indicate to the
compiler that the declaration should be entirely omitted when configured to
elide above a certain detail level. This is how scalac elides assertions if
you ask it to and you can do it to your own code. But this is something
different, not what's happening here, and a fairly niche/advanced feature.


On Sun, Jun 25, 2017 at 8:25 PM kant kodali  wrote:

> @Sean Got it! I come from Java world so I guess I was wrong in assuming
> that arguments are evaluated during the method invocation time. How about
> the conditional checks to see if the log is InfoEnabled or DebugEnabled?
> For Example,
>
> if (log.isInfoEnabled) log.info(msg)
>
> I hear we should use guard condition only when the string "msg"
> construction is expensive otherwise we will be taking a performance hit
> because of the additional "if" check unless the "log" itself is declared
> static final and scala compiler will strip away the "if" check and produce
> efficient byte code. Also log.info does the log.isInfoEnabled check
> inside the body prior to logging the messages.
>
>
> https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L509
>
> https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L599
>
> Please correct me if I am wrong.
>
>
>
>
> On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen  wrote:
>
>> Maybe you are looking for declarations like this. "=> String" means the
>> arg isn't evaluated until it's used, which is just what you want with log
>> statements. The message isn't constructed unless it will be logged.
>>
>> protected def logInfo(msg: => String) {
>>
>>
>> On Sun, Jun 25, 2017 at 10:28 AM kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I came across this file
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala
>>> and I am wondering what is the purpose of this? Especially it doesn't
>>> prevent any string concatenation and also the if checks are already done by
>>> the library itself right?
>>>
>>> Thanks!
>>>
>>>
>


How to Fill Sparse Data With the Previous Non-Empty Value in SPARQL Dataset

2017-06-25 Thread Carlo . Allocca
Dear All,

I need to apply a dataset transformation to replace null values with the 
previous Non-null Value.
As an example, I report the following:

from:

id | col1
-
1 null
1 null
2 4
2 null
2 null
3 5
3 null
3 null

to:

id  |  col1
-
1 null
1 null
2 4
2 4
2 4
3 5
3 5
3 5

I am using SPARK SQL 2 and the Dataset.

I searched on google but I only find solution in the context of database e.g 
(https://blog.jooq.org/2015/12/17/how-to-fill-sparse-data-with-the-previous-non-empty-value-in-sql/)

Please, any help how to implement this in SPARK ? I understood that I should 
use Windows and Lang but I cannot put them together.


Thank you in advance for your help.

Best Regards,
Carlo










Re: Question on Spark code

2017-06-25 Thread kant kodali
@Sean Got it! I come from Java world so I guess I was wrong in assuming
that arguments are evaluated during the method invocation time. How about
the conditional checks to see if the log is InfoEnabled or DebugEnabled?
For Example,

if (log.isInfoEnabled) log.info(msg)

I hear we should use guard condition only when the string "msg"
construction is expensive otherwise we will be taking a performance hit
because of the additional "if" check unless the "log" itself is declared
static final and scala compiler will strip away the "if" check and produce
efficient byte code. Also log.info does the log.isInfoEnabled check inside
the body prior to logging the messages.

https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L509
https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L599

Please correct me if I am wrong.




On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen  wrote:

> Maybe you are looking for declarations like this. "=> String" means the
> arg isn't evaluated until it's used, which is just what you want with log
> statements. The message isn't constructed unless it will be logged.
>
> protected def logInfo(msg: => String) {
>
>
> On Sun, Jun 25, 2017 at 10:28 AM kant kodali  wrote:
>
>> Hi All,
>>
>> I came across this file https://github.com/apache/spark/blob/master/core/
>> src/main/scala/org/apache/spark/internal/Logging.scala and I am
>> wondering what is the purpose of this? Especially it doesn't prevent any
>> string concatenation and also the if checks are already done by the library
>> itself right?
>>
>> Thanks!
>>
>>


Re: How does HashPartitioner distribute data in Spark?

2017-06-25 Thread Russell Spitzer
A more clear explanation.

`parallelize` does not apply a partitioner. We can see this pretty quickly
with a quick code example

scala> val rdd1 = sc.parallelize(Seq(("aa" , 1),("aa",2), ("aa", 3)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at
parallelize at :24

scala> rdd1.partitioner
res0: Option[org.apache.spark.Partitioner] = None

It has not partitioner because parallelize just packs up the collection
into partition metadata without looking at the actual contents of the
collection.

scala> rdd1.foreachPartition(it => println(it.length))
1
0
1
1
0
0
0
0

If we actually shuffle the data using the hash partitioner (using the
repartition command) we get the expected results.

scala> rdd1.repartition(8).foreachPartition(it => println(it.length))
0
0
0
0
0
0
0
3


On Sat, Jun 24, 2017 at 12:22 PM Russell Spitzer 
wrote:

> Neither of your code examples invoke a repartitioning. Add in a
> repartition command.
>
> On Sat, Jun 24, 2017, 11:53 AM Vikash Pareek <
> vikash.par...@infoobjects.com> wrote:
>
>> Hi Vadim,
>>
>> Thank you for your response.
>>
>> I would like to know how partitioner choose the key, If we look at my
>> example then following question arises:
>> 1. In case of rdd1, hash partitioning should calculate hashcode of key
>> (i.e. *"aa"* in this case), so *all records should go to single
>> partition*
>> instead of uniform distribution?
>>  2. In case of rdd2, there is no key value pair so how hash partitoning
>> going to work i.e. *what is the key* to calculate hashcode?
>>
>>
>>
>> Best Regards,
>>
>>
>> [image: InfoObjects Inc.] 
>> Vikash Pareek
>> Team Lead  *InfoObjects Inc.*
>> Big Data Analytics
>>
>> m: +91 8800206898 <+91%2088002%2006898> a: E5, Jhalana Institutionall
>> Area, Jaipur, Rajasthan 302004
>> w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com
>>
>>
>>
>>
>> On Fri, Jun 23, 2017 at 10:38 PM, Vadim Semenov <
>> vadim.seme...@datadoghq.com> wrote:
>>
>>> This is the code that chooses the partition for a key:
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88
>>>
>>> it's basically `math.abs(key.hashCode % numberOfPartitions)`
>>>
>>> On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
>>> vikash.par...@infoobjects.com> wrote:
>>>
 I am trying to understand how spark partitoing works.

 To understand this I have following piece of code on spark 1.6

 def countByPartition1(rdd: RDD[(String, Int)]) = {
 rdd.mapPartitions(iter => Iterator(iter.length))
 }
 def countByPartition2(rdd: RDD[String]) = {
 rdd.mapPartitions(iter => Iterator(iter.length))
 }

 //RDDs Creation
 val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1),
 ("aa",
 1)), 8)
 countByPartition(rdd1).collect()
 >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

 val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
 countByPartition(rdd2).collect()
 >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

 In both the cases data is distributed uniformaly.
 I do have following questions on the basis of above observation:

  1. In case of rdd1, hash partitioning should calculate hashcode of key
 (i.e. "aa" in this case), so all records should go to single partition
 instead of uniform distribution?
  2. In case of rdd2, there is no key value pair so how hash partitoning
 going to work i.e. what is the key to calculate hashcode?

 I have followed @zero323 answer but not getting answer of these.

 https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work




 -

 __Vikash Pareek
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-tp28785.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>


What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
You can do a map() using a select and functions/UDFs. But how do you process a 
partition using SQL?



Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu
Does adding -X to mvn command give you more information ?

Cheers

On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> Today I use new PC to compile SPARK.
> At the beginning, it worked well.
> But it stop at some point.
> the content in consle is :
> 
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-parent_2.11 ---
> [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
> jar.
> [INFO] Replacing original artifact with shaded artifact.
> [INFO]
> [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project Tags 2.1.2-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-tags_2.11 ---
> [INFO] Add Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\scala
> [INFO] Add Test Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\scala
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @
> spark-tags_2.11 ---
> [INFO] Dependencies classpath:
> C:\Users\shaof\.m2\repository\org\spark-project\spark\
> unused\1.0.0\unused-1.0.0.jar;C:\Users\shaof\.m2\repository\
> org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 2 Scala sources and 6 Java sources to
> E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @
> spark-tags_2.11 ---
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 6 source files to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\classes
> [INFO]
> [INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ spark-tags_2.11
> ---
> [INFO] Executing tasks
>
> main:
> [INFO] Executed tasks
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:testResources
> (default-testResources) @ spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) @ spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\test-classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile)
> @ spark-tags_2.11 ---
> [INFO] Nothing to compile - all classes are up to date
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath
> (generate-test-classpath) @ spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @
> spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT-tests.jar
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar
> [INFO]
> [INFO] --- 

Re: issue about the windows slice of stream

2017-06-25 Thread ??????????
Hi all,


Let me add more info about this.
The log showed:
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
the slice time is wrong.


For my test code:
lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { ??=== 
here the windowDuration is 2 seconds and the slideDuration is 8 seconds.
===key  log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms) ??=== here, 
the old RDD slices from 1498383077000 to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===key log end
===code in ReducedWindowedDStream.scala begin
override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc
val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
currentTime)
val previousWindow = currentWindow - slideDuration
logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)
// _
// | previous window |__
// |___| current window | --> Time
// |_|
//
// | | |___ _|
// | |
// V V
// old RDDs new RDDs
//
// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) ??== I think this line is 
"reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime + 
windowDuration - parent.slideDuration)"
logDebug("# old RDDs = " + oldRDDs.size)
// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)??== this line is 
"reducedStream.slice(previousWindow.endTime + windowDuration - 
parent.slideDuration,
currentWindow.endTime)"
logDebug("# new RDDs = " + newRDDs.size)
===code in ReducedWindowedDStream.scala end



Thanks
Fei Shao
 
---Original---
From: "??"<1427357...@qq.com>
Date: 2017/6/24 14:51:52
To: "user";"dev";
Subject: issue about the windows slice of stream


Hi all,
I found an issue about the windows slice of dstream.
My code is :


ssc = new StreamingContext( conf, Seconds(1))


val content = ssc.socketTextStream('ip','port')
content.countByValueAndWindow( Seconds(2),  Seconds(8)).foreach( println())
The key is that slide is greater than windows.
I checked the output.The result from  foreach( println()) was wrong.
I found the stream was cut apart wrong.
Can I open a JIRA please?


thanks
Fei Shao

Re: Can we access files on Cluster mode

2017-06-25 Thread sudhir k
Thank you . I guess I have to use common mount or s3 to access those files.

On Sun, Jun 25, 2017 at 4:42 AM Mich Talebzadeh 
wrote:

> Thanks. In my experience certain distros like Cloudera only support yarn
> client mode so AFAIK the driver stays on the Edge node. Happy to be
> corrected :)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 June 2017 at 10:37, Anastasios Zouzias  wrote:
>
>> Hi Mich,
>>
>> If the driver starts on the edge node with cluster mode, then I don't see
>> the difference between client and cluster deploy mode.
>>
>> In cluster mode, it is the responsibility of the resource manager (yarn,
>> etc) to decide where to run the driver (at least for spark 1.6 this is what
>> I have experienced).
>>
>> Best,
>> Anastasios
>>
>> On Sun, Jun 25, 2017 at 11:14 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Anastasios.
>>>
>>> Are you implying that in Yarn cluster mode even if you submit your Spark
>>> application on an Edge node the driver can start on any node. I was under
>>> the impression that the driver starts from the Edge node? and the executors
>>> can be on any node in the cluster (where Spark agents are running)?
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 June 2017 at 09:39, Anastasios Zouzias  wrote:
>>>
 Just to note that in cluster mode the spark driver might run on any
 node of the cluster, hence you need to make sure that the file exists on
 *all* nodes. Push the file on all nodes or use client deploy-mode.

 Best,
 Anastasios


 Am 24.06.2017 23:24 schrieb "Holden Karau" :

> addFile is supposed to not depend on a shared FS unless the semantics
> have changed recently.
>
> On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
> wrote:
>
>> Hi Sudhir,
>>
>> I believe you have to use a shared file system that is accused by all
>> nodes.
>>
>>
>> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>>
>>
>> I am new to Spark and i need some guidance on how to fetch files from
>> --files option on Spark-Submit.
>>
>> I read on some forums that we can fetch the files from
>> Spark.getFiles(fileName) and can use it in our code and all nodes should
>> read it.
>>
>> But i am facing some issue
>>
>> Below is the command i am using
>>
>> spark-submit --deploy-mode cluster --class com.check.Driver --files
>> /home/sql/first.sql test.jar 20170619
>>
>> so when i use SparkFiles.get(first.sql) , i should be able to read
>> the file Path but it is throwing File not Found exception.
>>
>> I tried SpackContext.addFile(/home/sql/first.sql) and then
>> SparkFiles.get(first.sql) but still the same error.
>>
>> Its working on the stand alone mode but not on cluster mode. Any help
>> is appreciated.. Using Spark 2.1.0 and Scala 2.11
>>
>> Thanks.
>>
>>
>> Regards,
>> Sudhir K
>>
>>
>>
>> --
>> Regards,
>> Sudhir K
>>
>>
>> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

>>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> 
>>
>
> --
Sent from Gmail Mobile


the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread ??????????
Hi all,


Today I use new PC to compile SPARK.
At the beginning, it worked well.
But it stop at some point.
the content in consle is :

[INFO]
[INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @ spark-parent_2.11 
---
[INFO]
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
spark-parent_2.11 ---
[INFO]
[INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-parent_2.11 ---
[INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar.
[INFO] Replacing original artifact with shaded artifact.
[INFO]
[INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @ 
spark-parent_2.11 ---
[INFO]
[INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @ 
spark-parent_2.11 ---
[INFO]
[INFO] 
[INFO] Building Spark Project Tags 2.1.2-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
spark-tags_2.11 ---
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
spark-tags_2.11 ---
[INFO] Add Source directory: 
E:\spark\fromweb\spark-branch-2.1\common\tags\src\main\scala
[INFO] Add Test Source directory: 
E:\spark\fromweb\spark-branch-2.1\common\tags\src\test\scala
[INFO]
[INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @ 
spark-tags_2.11 ---
[INFO] Dependencies classpath:
C:\Users\shaof\.m2\repository\org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;C:\Users\shaof\.m2\repository\org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar
[INFO]
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
spark-tags_2.11 ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
spark-tags_2.11 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
E:\spark\fromweb\spark-branch-2.1\common\tags\src\main\resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-tags_2.11 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 2 Scala sources and 6 Java sources to 
E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\classes...
[WARNING] : [options]  -source 1.7 ??
[WARNING] 1 ??
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ 
spark-tags_2.11 ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 6 source files to 
E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\classes
[INFO]
[INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ spark-tags_2.11 ---
[INFO] Executing tasks


main:
[INFO] Executed tasks
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
spark-tags_2.11 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
E:\spark\fromweb\spark-branch-2.1\common\tags\src\test\resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ 
spark-tags_2.11 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 3 Java sources to 
E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\test-classes...
[WARNING] : [options]  -source 1.7 ??
[WARNING] 1 ??
[INFO]
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ 
spark-tags_2.11 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-dependency-plugin:2.10:build-classpath 
(generate-test-classpath) @ spark-tags_2.11 ---
[INFO]
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ spark-tags_2.11 
---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-surefire-plugin:2.19.1:test (test) @ spark-tags_2.11 ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-tags_2.11 ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @ spark-tags_2.11 
---
[INFO] Building jar: 
E:\spark\fromweb\spark-branch-2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT-tests.jar
[INFO]
[INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spark-tags_2.11 ---
[INFO] Building jar: 
E:\spark\fromweb\spark-branch-2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar
[INFO]
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
spark-tags_2.11 ---
[INFO]
[INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-tags_2.11 ---
[INFO] Excluding org.scala-lang:scala-library:jar:2.11.8 from the shaded jar.
[INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar.
[INFO] Replacing original artifact 

Re: Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Thanks, Sean. I will ask them to do so.







Regards,
_
*Md. Rezaul Karim*, BSc, MSc, PhD
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 25 June 2017 at 12:39, Sean Owen  wrote:

> Please get Packt to fix their existing PR. It's been open for months
> https://github.com/apache/spark-website/pull/35
>
> On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi Sean,
>>
>> Last time, you helped me add a book info (in the books section) on this
>> page https://spark.apache.org/documentation.html.
>>
>> Could you please add another book info. Here's necessary information
>> about the book:
>>
>> *Title*: Scala and Spark for Big Data Analytics
>> *Authors*: Md. Rezaul Karim, Sridhar Alla
>> *Publisher*: Packt Publishing
>> *URL*: https://www.packtpub.com/big-data-and-business-
>> intelligence/scala-and-spark-big-data-analytics
>>
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc, PhD
>> Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> 
>>
>


Re: Could you please add a book info on Spark website?

2017-06-25 Thread Sean Owen
Please get Packt to fix their existing PR. It's been open for months
https://github.com/apache/spark-website/pull/35

On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi Sean,
>
> Last time, you helped me add a book info (in the books section) on this
> page https://spark.apache.org/documentation.html.
>
> Could you please add another book info. Here's necessary information about
> the book:
>
> *Title*: Scala and Spark for Big Data Analytics
> *Authors*: Md. Rezaul Karim, Sridhar Alla
> *Publisher*: Packt Publishing
> *URL*:
> https://www.packtpub.com/big-data-and-business-intelligence/scala-and-spark-big-data-analytics
>
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc, PhD
> Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>


Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Hi Sean,

Last time, you helped me add a book info (in the books section) on this
page https://spark.apache.org/documentation.html.

Could you please add another book info. Here's necessary information about
the book:

*Title*: Scala and Spark for Big Data Analytics
*Authors*: Md. Rezaul Karim, Sridhar Alla
*Publisher*: Packt Publishing
*URL*:
https://www.packtpub.com/big-data-and-business-intelligence/scala-and-spark-big-data-analytics





Regards,
_
*Md. Rezaul Karim*, BSc, MSc, PhD
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Eko Susilo
Hi,

I have a data frame collection called “secondDf” when I tried to perform 
groupBy and then sum of each column it works perfectly. However when I tried to 
calculate average of that column it says the column name is not found. The 
details are as follow

val total = secondDf.filter("ImageWidth > 1 and ImageHeight > 1").
   groupBy("MainDomainCode").
   agg(sum("ImageWidth"),
   sum("ImageHeight"),
   sum("ImageArea”))


total.show  will show result as expected, However when I tried to calculate 
avg, the result is script error. Any help to resolve this issue?

Regards,
Eko


  val average = secondDf.filter("ImageWidth > 1 and ImageHeight > 1").
   groupBy("MainDomainCode").
   agg(avg("ImageWidth"),
   avg("ImageHeight"),
   avg("ImageArea"))


org.apache.spark.sql.AnalysisException: Cannot resolve column name "ImageWidth" 
among (MainDomainCode, avg(length(ImageWidth)));
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:63)
at $iwC$$iwC$$iwC$$iwC.(:65)
at $iwC$$iwC$$iwC.(:67)
at $iwC$$iwC.(:69)
at $iwC.(:71)
at (:73)
at .(:77)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at 

Re: Question on Spark code

2017-06-25 Thread Sean Owen
Maybe you are looking for declarations like this. "=> String" means the arg
isn't evaluated until it's used, which is just what you want with log
statements. The message isn't constructed unless it will be logged.

protected def logInfo(msg: => String) {

On Sun, Jun 25, 2017 at 10:28 AM kant kodali  wrote:

> Hi All,
>
> I came across this file
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala
> and I am wondering what is the purpose of this? Especially it doesn't
> prevent any string concatenation and also the if checks are already done by
> the library itself right?
>
> Thanks!
>
>


Re: Can we access files on Cluster mode

2017-06-25 Thread Mich Talebzadeh
Thanks. In my experience certain distros like Cloudera only support yarn
client mode so AFAIK the driver stays on the Edge node. Happy to be
corrected :)



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 June 2017 at 10:37, Anastasios Zouzias  wrote:

> Hi Mich,
>
> If the driver starts on the edge node with cluster mode, then I don't see
> the difference between client and cluster deploy mode.
>
> In cluster mode, it is the responsibility of the resource manager (yarn,
> etc) to decide where to run the driver (at least for spark 1.6 this is what
> I have experienced).
>
> Best,
> Anastasios
>
> On Sun, Jun 25, 2017 at 11:14 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Anastasios.
>>
>> Are you implying that in Yarn cluster mode even if you submit your Spark
>> application on an Edge node the driver can start on any node. I was under
>> the impression that the driver starts from the Edge node? and the executors
>> can be on any node in the cluster (where Spark agents are running)?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 June 2017 at 09:39, Anastasios Zouzias  wrote:
>>
>>> Just to note that in cluster mode the spark driver might run on any node
>>> of the cluster, hence you need to make sure that the file exists on *all*
>>> nodes. Push the file on all nodes or use client deploy-mode.
>>>
>>> Best,
>>> Anastasios
>>>
>>>
>>> Am 24.06.2017 23:24 schrieb "Holden Karau" :
>>>
 addFile is supposed to not depend on a shared FS unless the semantics
 have changed recently.

 On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
 wrote:

> Hi Sudhir,
>
> I believe you have to use a shared file system that is accused by all
> nodes.
>
>
> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>
>
> I am new to Spark and i need some guidance on how to fetch files from
> --files option on Spark-Submit.
>
> I read on some forums that we can fetch the files from
> Spark.getFiles(fileName) and can use it in our code and all nodes should
> read it.
>
> But i am facing some issue
>
> Below is the command i am using
>
> spark-submit --deploy-mode cluster --class com.check.Driver --files
> /home/sql/first.sql test.jar 20170619
>
> so when i use SparkFiles.get(first.sql) , i should be able to read the
> file Path but it is throwing File not Found exception.
>
> I tried SpackContext.addFile(/home/sql/first.sql) and then
> SparkFiles.get(first.sql) but still the same error.
>
> Its working on the stand alone mode but not on cluster mode. Any help
> is appreciated.. Using Spark 2.1.0 and Scala 2.11
>
> Thanks.
>
>
> Regards,
> Sudhir K
>
>
>
> --
> Regards,
> Sudhir K
>
>
> --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>


Re: Question on Spark code

2017-06-25 Thread Herman van Hövell tot Westerflier
I am not getting the question. The logging trait does exactly what is says
on the box, I don't see what string concatenation has to do with it.

On Sun, Jun 25, 2017 at 11:27 AM, kant kodali  wrote:

> Hi All,
>
> I came across this file https://github.com/apache/spark/blob/master/core/
> src/main/scala/org/apache/spark/internal/Logging.scala and I am wondering
> what is the purpose of this? Especially it doesn't prevent any string
> concatenation and also the if checks are already done by the library itself
> right?
>
> Thanks!
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]



Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
Hi Mich,

If the driver starts on the edge node with cluster mode, then I don't see
the difference between client and cluster deploy mode.

In cluster mode, it is the responsibility of the resource manager (yarn,
etc) to decide where to run the driver (at least for spark 1.6 this is what
I have experienced).

Best,
Anastasios

On Sun, Jun 25, 2017 at 11:14 AM, Mich Talebzadeh  wrote:

> Hi Anastasios.
>
> Are you implying that in Yarn cluster mode even if you submit your Spark
> application on an Edge node the driver can start on any node. I was under
> the impression that the driver starts from the Edge node? and the executors
> can be on any node in the cluster (where Spark agents are running)?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 June 2017 at 09:39, Anastasios Zouzias  wrote:
>
>> Just to note that in cluster mode the spark driver might run on any node
>> of the cluster, hence you need to make sure that the file exists on *all*
>> nodes. Push the file on all nodes or use client deploy-mode.
>>
>> Best,
>> Anastasios
>>
>>
>> Am 24.06.2017 23:24 schrieb "Holden Karau" :
>>
>>> addFile is supposed to not depend on a shared FS unless the semantics
>>> have changed recently.
>>>
>>> On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
>>> wrote:
>>>
 Hi Sudhir,

 I believe you have to use a shared file system that is accused by all
 nodes.


 On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:


 I am new to Spark and i need some guidance on how to fetch files from
 --files option on Spark-Submit.

 I read on some forums that we can fetch the files from
 Spark.getFiles(fileName) and can use it in our code and all nodes should
 read it.

 But i am facing some issue

 Below is the command i am using

 spark-submit --deploy-mode cluster --class com.check.Driver --files
 /home/sql/first.sql test.jar 20170619

 so when i use SparkFiles.get(first.sql) , i should be able to read the
 file Path but it is throwing File not Found exception.

 I tried SpackContext.addFile(/home/sql/first.sql) and then
 SparkFiles.get(first.sql) but still the same error.

 Its working on the stand alone mode but not on cluster mode. Any help
 is appreciated.. Using Spark 2.1.0 and Scala 2.11

 Thanks.


 Regards,
 Sudhir K



 --
 Regards,
 Sudhir K


 --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


-- 
-- Anastasios Zouzias



Question on Spark code

2017-06-25 Thread kant kodali
Hi All,

I came across this file
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala
and I am wondering what is the purpose of this? Especially it doesn't
prevent any string concatenation and also the if checks are already done by
the library itself right?

Thanks!


Re: Can we access files on Cluster mode

2017-06-25 Thread Mich Talebzadeh
Hi Anastasios.

Are you implying that in Yarn cluster mode even if you submit your Spark
application on an Edge node the driver can start on any node. I was under
the impression that the driver starts from the Edge node? and the executors
can be on any node in the cluster (where Spark agents are running)?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 June 2017 at 09:39, Anastasios Zouzias  wrote:

> Just to note that in cluster mode the spark driver might run on any node
> of the cluster, hence you need to make sure that the file exists on *all*
> nodes. Push the file on all nodes or use client deploy-mode.
>
> Best,
> Anastasios
>
>
> Am 24.06.2017 23:24 schrieb "Holden Karau" :
>
>> addFile is supposed to not depend on a shared FS unless the semantics
>> have changed recently.
>>
>> On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
>> wrote:
>>
>>> Hi Sudhir,
>>>
>>> I believe you have to use a shared file system that is accused by all
>>> nodes.
>>>
>>>
>>> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>>>
>>>
>>> I am new to Spark and i need some guidance on how to fetch files from
>>> --files option on Spark-Submit.
>>>
>>> I read on some forums that we can fetch the files from
>>> Spark.getFiles(fileName) and can use it in our code and all nodes should
>>> read it.
>>>
>>> But i am facing some issue
>>>
>>> Below is the command i am using
>>>
>>> spark-submit --deploy-mode cluster --class com.check.Driver --files
>>> /home/sql/first.sql test.jar 20170619
>>>
>>> so when i use SparkFiles.get(first.sql) , i should be able to read the
>>> file Path but it is throwing File not Found exception.
>>>
>>> I tried SpackContext.addFile(/home/sql/first.sql) and then
>>> SparkFiles.get(first.sql) but still the same error.
>>>
>>> Its working on the stand alone mode but not on cluster mode. Any help is
>>> appreciated.. Using Spark 2.1.0 and Scala 2.11
>>>
>>> Thanks.
>>>
>>>
>>> Regards,
>>> Sudhir K
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sudhir K
>>>
>>>
>>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
Just to note that in cluster mode the spark driver might run on any node of
the cluster, hence you need to make sure that the file exists on *all*
nodes. Push the file on all nodes or use client deploy-mode.

Best,
Anastasios

Am 24.06.2017 23:24 schrieb "Holden Karau" :

> addFile is supposed to not depend on a shared FS unless the semantics have
> changed recently.
>
> On Sat, Jun 24, 2017 at 11:55 AM varma dantuluri 
> wrote:
>
>> Hi Sudhir,
>>
>> I believe you have to use a shared file system that is accused by all
>> nodes.
>>
>>
>> On Jun 24, 2017, at 1:30 PM, sudhir k  wrote:
>>
>>
>> I am new to Spark and i need some guidance on how to fetch files from
>> --files option on Spark-Submit.
>>
>> I read on some forums that we can fetch the files from
>> Spark.getFiles(fileName) and can use it in our code and all nodes should
>> read it.
>>
>> But i am facing some issue
>>
>> Below is the command i am using
>>
>> spark-submit --deploy-mode cluster --class com.check.Driver --files
>> /home/sql/first.sql test.jar 20170619
>>
>> so when i use SparkFiles.get(first.sql) , i should be able to read the
>> file Path but it is throwing File not Found exception.
>>
>> I tried SpackContext.addFile(/home/sql/first.sql) and then
>> SparkFiles.get(first.sql) but still the same error.
>>
>> Its working on the stand alone mode but not on cluster mode. Any help is
>> appreciated.. Using Spark 2.1.0 and Scala 2.11
>>
>> Thanks.
>>
>>
>> Regards,
>> Sudhir K
>>
>>
>>
>> --
>> Regards,
>> Sudhir K
>>
>>
>> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


RDD and DataFrame persistent memory usage

2017-06-25 Thread Ashok Kumar
 Gurus,
I understand when we create RDD in Spark it is immutable.
So I have few points please:
   
   - When RDD is created that is just a pointer. Not most Spark operations it 
is lazy not consumed until a collection operation done that affects RDD?
   - When a DF is created from RDD does that result in additional memory to DF. 
Again with collection operation that affects both RDD and DF built from that 
RDD?
   - There is some references that as you build operations and creating new 
DFs, one is consuming more and more memory without releasing it back?
   - What will happen if I do df.unpersist. I know that it shifts DF from 
memory (cache) to disk. Will that reduce memory overhead?
   - Is it a good idea to unpersist to reduce memory overhead?


Thanking you