RE: Add jar files on classpath when submitting tasks to Spark

2016-11-02 Thread Jan Botorek
Thank you for the example.
I am able to submit the task when using the –jars parameter as followed:

spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master 
local –jars path/to/jar/one;path/to/jar/two 
C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data"

But, I would like to find out, why the setting of spark.driver.extraClassPath 
attribute in spark-defaults.xml is not applied when submitting the task.
In our scenario let’s assume that all workers (currently only one worker) have 
the attribute spark.driver.extraClassPath set to the same path and the folder 
on all workers contains the same .jar files.

Thank you for your help,

Regards,
Jan

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 3:22 PM
To: Jan Botorek <jan.boto...@infor.com>
Cc: Vinod Mangipudi <vinod...@gmail.com>; user <user@spark.apache.org>
Subject: Re: Add jar files on classpath when submitting tasks to Spark

If you are using local mode then there is only one JVM. In Linux as below mine 
looks like this

${SPARK_HOME}/bin/spark-submit \
--packages ${PACKAGES} \
--driver-memory 8G \
--num-executors 1 \
--executor-memory 8G \
--master local[12] \
--conf "${SCHEDULER}" \
--conf "${EXTRAJAVAOPTIONS}" \
--jars ${JARS} \
--class "${FILE_NAME}" \
--conf "${SPARKUIPORT}" \
--conf "${SPARKDRIVERPORT}" \
--conf "${SPARKFILESERVERPORT}" \
--conf "${SPARKBLOCKMANAGERPORT}" \
--conf "${SPARKKRYOSERIALIZERBUFFERMAX}" \
${JAR_FILE}

These parameters are defined below

function default_settings {
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export SCHEDULER="spark.scheduler.mode=FAIR"
export EXTRAJAVAOPTIONS="spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps"
export JARS="/home/hduser/jars/spark-streaming-kafka-assembly_2.11-1.6.1.jar"
export SPARKUIPORT="spark.ui.port=5"
export SPARKDRIVERPORT="spark.driver.port=54631"
export SPARKFILESERVERPORT="spark.fileserver.port=54731"
export SPARKBLOCKMANAGERPORT="spark.blockManager.port=54832"
export SPARKKRYOSERIALIZERBUFFERMAX="spark.kryoserializer.buffer.max=512"
}

and other jar files have passed through --jars. Note that ${JAR_FILE} in my 
case is built through MVN or SBT

HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 14:02, Jan Botorek 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Yes, exactly.
My (testing) run script is:
spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master 
local C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data"



From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>]
Sent: Tuesday, November 1, 2016 2:51 PM
To: Jan Botorek <jan.boto...@infor.com<mailto:jan.boto...@infor.com>>
Cc: Vinod Mangipudi <vinod...@gmail.com<mailto:vinod...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>

Subject: Re: Add jar files on classpath when submitting tasks to Spark

Are you submitting your job through spark-submit?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:39, Jan Botorek 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Hello,
This approach unfortunately doesn’t work for job submission for me. It works in 
the shell, but not when submitted.
I ensured the (only worker) node has desired directory.

Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* 
works.

Could you verify, that using this settings you are able to submit jobs with 
according dependencies, please?

From: Mich Taleb

RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Yes, exactly.
My (testing) run script is:
spark-submit --class com.infor.skyvault.tests.LinearRegressionTest --master 
local C:\_resources\spark-1.0-SNAPSHOT.jar -DtrainDataPath="/path/to/model/data"



From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 2:51 PM
To: Jan Botorek <jan.boto...@infor.com>
Cc: Vinod Mangipudi <vinod...@gmail.com>; user <user@spark.apache.org>
Subject: Re: Add jar files on classpath when submitting tasks to Spark

Are you submitting your job through spark-submit?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:39, Jan Botorek 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Hello,
This approach unfortunately doesn’t work for job submission for me. It works in 
the shell, but not when submitted.
I ensured the (only worker) node has desired directory.

Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* 
works.

Could you verify, that using this settings you are able to submit jobs with 
according dependencies, please?

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>]
Sent: Tuesday, November 1, 2016 2:18 PM
To: Vinod Mangipudi <vinod...@gmail.com<mailto:vinod...@gmail.com>>

Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Add jar files on classpath when submitting tasks to Spark

you can do that as long as every node has the directory referenced.

For example

spark.driver.extraClassPath  
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar

this will work as long as all nodes have that directory.

The other alternative is to mount the shared directory as NFS mount across all 
the nodes and all the noses can read from that shared directory

HTH






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:04, Vinod Mangipudi 
<vinod...@gmail.com<mailto:vinod...@gmail.com>> wrote:
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek <jan.boto...@infor.com<mailto:jan.boto...@infor.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan





RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello,
This approach unfortunately doesn’t work for job submission for me. It works in 
the shell, but not when submitted.
I ensured the (only worker) node has desired directory.

Neither specifying all jars as you suggested, neither using /path/to/jarfiles/* 
works.

Could you verify, that using this settings you are able to submit jobs with 
according dependencies, please?

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, November 1, 2016 2:18 PM
To: Vinod Mangipudi <vinod...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: Add jar files on classpath when submitting tasks to Spark

you can do that as long as every node has the directory referenced.

For example

spark.driver.extraClassPath  
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar
spark.executor.extraClassPath
/home/hduser/jars/ojdbc6.jar:/home/hduser/jars/jconn4.jar

this will work as long as all nodes have that directory.

The other alternative is to mount the shared directory as NFS mount across all 
the nodes and all the noses can read from that shared directory

HTH






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 1 November 2016 at 13:04, Vinod Mangipudi 
<vinod...@gmail.com<mailto:vinod...@gmail.com>> wrote:
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek <jan.boto...@infor.com<mailto:jan.boto...@infor.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan




RE: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Thank you for the reply.
I am aware of the parameters used when submitting the tasks (--jars is working 
for us).

But, isn’t there any way how to specify a location (directory) for jars „in 
global“ - in the spark-defaults.conf??


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, November 1, 2016 1:49 PM
To: Jan Botorek <jan.boto...@infor.com>
Cc: user <user@spark.apache.org>
Subject: Re: Add jar files on classpath when submitting tasks to Spark


There are options to specify external jars in the form of --jars, 
--driver-classpath etc depending on spark version and cluster manager.. Please 
see spark documents for configuration sections and/or run spark submit help to 
see available options.
On 1 Nov 2016 23:13, "Jan Botorek" 
<jan.boto...@infor.com<mailto:jan.boto...@infor.com>> wrote:
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely – when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan


Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Jan Botorek
Hello,
I have a problem trying to add jar files to be available on classpath when 
submitting task to Spark.

In my spark-defaults.conf file I have configuration:
spark.driver.extraClassPath = path/to/folder/with/jars
all jars in the folder are available in SPARK-SHELL

The problem is that jars are not on the classpath for SPARK-MASTER; more 
precisely - when I submit any job that utilizes any jar from external folder, 
the java.lang.ClassNotFoundException is thrown.
Moving all external jars into the jars folder solves the situation, but we need 
to keep external files separatedly.

Thank you for any help
Best regards,
Jan


RE: Help needed in parsing JSon with nested structures

2016-10-31 Thread Jan Botorek
Hello,
>From my point of view, it would be more efficient and probably i more 
>"readible" if you just extracted the required data using some json parsing 
>library (GSON, Jackson), construct some global object (or pre-process data), 
>and then begin with the Spark operations.

Jan

From: Kappaganthu, Sivaram (ES) [mailto:sivaram.kappagan...@adp.com]
Sent: Monday, October 31, 2016 11:50 AM
To: user@spark.apache.org
Subject: Help needed in parsing JSon with nested structures

Hello All,



I am processing a nested complex Json and below is the schema for it.
root
|-- businessEntity: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- payGroup: array (nullable = true)
||||-- element: struct (containsNull = true)
|||||-- reportingPeriod: struct (nullable = true)
||||||-- worker: array (nullable = true)
|||||||-- element: struct (containsNull = true)
||||||||-- category: string (nullable = true)
||||||||-- person: struct (nullable = true)
||||||||-- tax: array (nullable = true)
|||||||||-- element: struct (containsNull = 
true)
||||||||||-- code: string (nullable = true)
||||||||||-- qtdAmount: double (nullable = 
true)
||||||||||-- ytdAmount: double (nullable =
My requirement is to create a hashmap with code concatenated with qtdAmount as 
key and value of qtdAmount as value. Map.put(code + "qtdAmount" , qtdAmount). 
How can i do this with spark.
I tried with below shell commands.
import org.apache.spark.sql._
val sqlcontext = new SQLContext(sc)
val cdm = sqlcontext.read.json("/user/edureka/CDM/cdm.json")
val spark = 
SparkSession.builder().appName("SQL").config("spark.some.config.option","some-vale").getOrCreate()
cdm.createOrReplaceTempView("CDM")
val sqlDF = spark.sql("SELECT businessEntity[0].payGroup[0] from CDM").show()
val address = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].person.address from CDM 
as address")
val worker = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker from CDM")
val tax = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val tax = sqlcontext.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val codes = tax.select(expode(tax("code"))
scala> val codes = 
tax.withColumn("code",explode(tax("tax.code"))).withColumn("qtdAmount",explode(tax("tax.qtdAmount"))).withColumn("ytdAmount",explode(tax("tax.ytdAmount")))


i am trying to get all the codes and qtdAmount into a map. But i am not getting 
it. Using multiple explode statements for a single DF, is producing Cartesian 
product of the elements.
Could someone please help on how to parse the json of this much complex in 
spark.


Thanks,
Sivaram


This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


RE: No of partitions in a Dataframe

2016-10-27 Thread Jan Botorek
Hello, Nipun
In my opinion, the „converting the dataframe to an RDD“ wouldn’t be a costly 
operation since Dataframe (Dataset) operations are under the hood operated 
always as RDDs. I don’t know which version of Spark you operate, but I suppose 
you utilize the 2.0.
I would, therefore go for:

dataFrame.rdd.partitions

That returns Array of partitions (writen in SCALA).

Regards,
Jan

From: Nipun Parasrampuria [mailto:paras...@umn.edu]
Sent: Thursday, October 27, 2016 12:01 AM
To: user@spark.apache.org
Subject: No of partitions in a Dataframe


How do I find the number of partitions in a dataframe without converting the 
dataframe to an RDD(I'm assuming that it's a costly operation).

If there's no way to do so, I wonder why the API doesn't include a method like 
that(an explanation for why such a method would be useless, perhaps)

Thanks!
Nipun