Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia  wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









  

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
I guess you could write a custom RDD that can read data from a memory-mapped 
file - not really my area of expertise so I’ll leave it to other members of the 
forum to chip in with comments as to whether that makes sense. 

But if you want ‘fancy analytics’ then won’t the processing time more than 
out-weigh the savings from using memory mapped files? Particularly if your 
analytics involve any kind of aggregation of data across data nodes. Have you 
looked at a Lambda architecture which could involve Spark but doesn’t 
necessarily mean you would go to the trouble of implementing a custom 
memory-mapped file reading feature.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 7 Dec 2015, at 17:32, Jia  wrote:
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East  > wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so 
>> you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m 
>> sure it would be possible with enough tinkering but it’s not clear what you 
>> are trying to achieve. Spark is a distributed processing system, it has 
>> multiple JVMs running on different machines that each run a small part of 
>> the overall processing. Unless you have some sort of idea to have multiple 
>> C++ processes collocated with the distributed JVMs using named memory mapped 
>> files doesn’t make architectural sense. 
>> ---
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia >> > wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read 
>>> data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared 
>>> memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>> this, but I wonder whether there is any existing efforts or more efficient 
>>> approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>> 
>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>> 
>>> 
>> 
> 



Fwd: Oozie SparkAction not able to use spark conf values

2015-12-07 Thread Rajadayalan Perumalsamy
Hi

We are trying to change our existing oozie workflows to use SparkAction
instead of ShellAction.
We are passing spark configuration in spark-opts with --conf, but these
values are not accessible in Spark and it is throwing error.

Please note we are able to use SparkAction successfully in yarn-cluster
mode if we are not using the spark configurations. I have attached oozie
workflow.xml, oozie log and yarncontainer-spark log files.

Thanks
Raja
Log Type: stderr
Log Upload Time: Fri Dec 04 10:26:17 -0800 2015
Log Length: 6035
15/12/04 10:26:04 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/12/04 10:26:05 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1447700095990_2804_02
15/12/04 10:26:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/12/04 10:26:06 INFO spark.SecurityManager: Changing view acls to: dev
15/12/04 10:26:06 INFO spark.SecurityManager: Changing modify acls to: dev
15/12/04 10:26:06 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(dev); users with 
modify permissions: Set(dev)
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/12/04 10:26:06 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
15/12/04 10:26:06 ERROR yarn.ApplicationMaster: User class threw exception: null
java.lang.ExceptionInInitializerError
at com.toyota.genericmodule.info.Info$.(Info.scala:20)
at com.toyota.genericmodule.info.Info$.(Info.scala)
at com.toyota.genericmodule.info.Info.main(Info.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
Caused by: com.typesafe.config.ConfigException$UnresolvedSubstitution: 
application.conf: 53: Could not resolve substitution to a value: ${inputdb}
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:84)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigConcatenation.resolveSubstitutions(ConfigConcatenation.java:178)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigDelayedMerge.resolveSubstitutions(ConfigDelayedMerge.java:96)
at 
com.typesafe.config.impl.ConfigDelayedMerge.resolveSubstitutions(ConfigDelayedMerge.java:59)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ResolveSource.lookupSubst(ResolveSource.java:62)
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:73)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ConfigConcatenation.resolveSubstitutions(ConfigConcatenation.java:178)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.ResolveSource.lookupSubst(ResolveSource.java:62)
at 
com.typesafe.config.impl.ConfigReference.resolveSubstitutions(ConfigReference.java:73)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
at 
com.typesafe.config.impl.SimpleConfigObject$1.modifyChildMayThrow(SimpleConfigObject.java:340)
at 
com.typesafe.config.impl.SimpleConfigObject.modifyMayThrow(SimpleConfigObject.java:279)
at 
com.typesafe.config.impl.SimpleConfigObject.resolveSubstitutions(SimpleConfigObject.java:320)
at 
com.typesafe.config.impl.SimpleConfigObject.resolveSubstitutions(SimpleConfigObject.java:24)
at 
com.typesafe.config.impl.ResolveSource.resolveCheckingReplacement(ResolveSource.java:110)
at 
com.typesafe.config.impl.ResolveContext.resolve(ResolveContext.java:114)
   

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia  wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo  wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia  wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









   



  

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
This is a very helpful article. Thanks for the help.

Ningjun

From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Monday, December 07, 2015 12:42 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to create dataframe from SQL Server SQL query

Hi Ningjun,

Haven't done this myself, saw your question and was curious about the answer 
and found this article which you might find useful:
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/

According this article, you can pass in your SQL statement in the "dbtable" 
mapping, ie, something like:

val jdbcDF = sqlContext.read.format("jdbc")
.options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "(select docid, title, docText from dbo.document 
where docid between 10 and 1000)"
)).load

-sujit

On Mon, Dec 7, 2015 at 8:26 AM, Wang, Ningjun (LNG-NPV) 
> wrote:
How can I create a RDD from a SQL query against SQLServer database? Here is the 
example of dataframe

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview


val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

This code create dataframe from a table. How can I create dataframe from a 
query, e.g. “select docid, title, docText from dbo.document where docid between 
10 and 1000”?

Ningjun




Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Annabel

Spark works very well with data stored in HDFS but is certainly not tied to it. 
Have a look at the wide variety of connectors to things like Cassandra, HBase, 
etc.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo  wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
> What you're requesting, reading and writing to a C++ process, is not part of 
> that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia  wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write 
> and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
> data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo  
>> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>> and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the 
>> data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia  wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can 
>> connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting 
>> to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a 
>> very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful  wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>>> bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing 
>>> list.
>>> Yes, we have a distributed C++ application, that will store data on each 
>>> node in the cluster, and we hope to leverage Spark to do more fancy 
>>> analytics on those data. But we need high performance, that’s why we want 
>>> shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
 On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
 
 -dev, +user (this is not a question about development of Spark itself so 
 you’ll get more answers in the user mailing list)
 
 First up let me say that I don’t really know how this could be done - I’m 
 sure it would be possible with enough tinkering but it’s not clear what 
 you are trying to achieve. Spark is a distributed processing system, it 
 has multiple JVMs running on different machines that each run a small part 
 of the overall processing. Unless you have some sort of idea to have 
 multiple C++ processes collocated with the distributed JVMs using named 
 memory mapped files doesn’t make architectural sense. 
 ---
 Robin East
 Spark GraphX in Action Michael Malak and Robin East
 Manning Publications Co.
 http://www.manning.com/books/spark-graphx-in-action
 
 
 
 
 
> On 6 Dec 2015, at 20:43, Jia  wrote:
> 
> Dears, for one project, I need to implement something so Spark can read 
> data from a C++ process. 
> To provide high performance, I really hope to implement this through 
> shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to 
> do this, but I wonder whether there is any existing efforts or more 
> efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 


How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread SRK
Hi,

How to do a maven build to enable monitoring using Ganglia? What is the
command for the same?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Keith Freeman
I'm not a python expert, so I'm wondering if anybody has a working 
example of a partitioner for the "partitionFunc" argument (default 
"portable_hash") to rdd.partitionBy()?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



<    1   2