Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread ayan guha
On yarn, logs are aggregated from each containers to hdfs. You can use yarn CLI or ui to view. For spark, you would have a history server which consolidate s the logs On 21 Sep 2016 19:03, "Nisha Menon" wrote: > I looked at the driver logs, that reminded me that I needed

SPARK PERFORMANCE TUNING

2016-09-21 Thread Trinadh Kaja
Hi all, how to increase spark performance ,i am using pyspark. cluster info : Total memory :600gb Cores:96 command : spark-submit --master yarn-client --executor-memory 10G --num-executors 50 --executor-cores 2 --driver-memory 10g --queue thequeue please help on this -- Thanks

Re: Missing output partition file in S3

2016-09-21 Thread Steve Loughran
On 19 Sep 2016, at 18:54, Chen, Kevin > wrote: Hi Steve, Our S3 is on US east. But this issue also occurred when we using a S3 bucket on US west. We are using S3n. We use Spark standalone deployment. We run the job in EC2. The datasets

How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
I looked at the driver logs, that reminded me that I needed to look at the executor logs. There the issue was that the spark executors were not getting a configuration file. I broadcasted the file and now the processing happens. Thanks for the suggestion. Currently my issue is that the log file

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
Well I have already tried that. You are talking about a command similar to this right? *yarn logs -applicationId application_Number * This gives me the processing logs, that contain information about the tasks, RDD blocks etc. What I really need is the output log that gets generated as part of

Re: Spark tasks blockes randomly on standalone cluster

2016-09-21 Thread bogdanbaraila
Does anyone has any ideas o what may be happening? Regards, Bogdan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-tasks-blockes-randomly-on-standalone-cluster-tp27693p27769.html Sent from the Apache Spark User List mailing list archive at

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Yan Facai
Thanks, Peter. It works! Why udf is needed? On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi wrote: > Hi Yan, I agree, it IS really confusing. Here is the technique for > transforming a column. It is very general because you can make "myConvert" > do whatever

unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread muhammet pakyürek
while i run the spark-shell as below spark-shell --jars '/home/ktuser/spark-cassandra-connector/target/scala-2.11/root_2.11-2.0.0-M3-20-g75719df.jar' --packages datastax:spark-cassandra-connector:2.0.0-s_2.11-M3-20-g75719df --conf spark.cassandra.connection.host=localhost i get the error

Re: Similar Items

2016-09-21 Thread Nick Pentreath
Sorry, the original repo: https://github.com/karlhigley/spark-neighbors On Wed, 21 Sep 2016 at 13:09 Nick Pentreath wrote: > I should also point out another library I had not come across before : > https://github.com/sethah/spark-neighbors > > > On Tue, 20 Sep 2016 at

increase spark performance

2016-09-21 Thread Trinadh Kaja
Hi all, how to increase spark performance , cluster info : total memory :600gb cores -- Thanks K.Trinadh Ph-7348826118

Re: Similar Items

2016-09-21 Thread Nick Pentreath
I should also point out another library I had not come across before : https://github.com/sethah/spark-neighbors On Tue, 20 Sep 2016 at 21:03 Kevin Mellott wrote: > Using the Soundcloud implementation of LSH, I was able to process a 22K > product dataset in a mere 65

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke
Do you mind sharing what your software does? What is the input data size? What is the spark version and apis used? How many nodes? What is the input data format? Is compression used? > On 21 Sep 2016, at 13:37, Trinadh Kaja wrote: > > Hi all, > > how to increase spark

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either. On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed, Sep 21, 2016 at 12:00 AM, Peter

Re: OutOfMemory while calculating window functions

2016-09-21 Thread Jeremy Davis
Here is a unit test that will OOM a 10G heap -- import java.sql.Timestamp import org.apache.spark.sql.SparkSession import org.junit.Test import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ import scala.collection.mutable.ArrayBuffer /** * A

Re: The coming data on Spark Streaming

2016-09-21 Thread pcandido
Anybody? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-coming-data-on-Spark-Streaming-tp27720p27771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

spark stream based deduplication

2016-09-21 Thread backtrack5
I want to do hash based comparison to find duplicate records. Record which i receive from stream will have hashid,recordid field in it. 1. I want to have all the historic records (hashid, recordid --> key,value) in memory RDD 2. When a new record is received in spark DStream RDD i want to compare

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh
LOL I think we should try the Chrystal ball to answer this question. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Get profile from sbt

2016-09-21 Thread Bedrytski Aliaksandr
Hi Saurabh, you may use BuildInfo[1] sbt plugin to access values defined in build.sbt Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 19, 2016, at 18:28, Saurabh Malviya (samalviy) wrote: > Hi, > > Is there any way equivalent to profiles in maven in sbt. I want spark > build

Apache Spark JavaRDD pipe() need help

2016-09-21 Thread shashikant.kulka...@gmail.com
Hi All, I am trying to use the JavaRDD.pipe() API. I have one object with me from the JavaRDD and not the complete RDD. I mean I am operating on one object inside the RDD. In my object I have some attribute values using which I create one string like "param1 param2 param3 param4". I have one C

Re: problems with checkpoint and spark sql

2016-09-21 Thread Dhimant
Hi David, You got any solution for this ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problems-with-checkpoint-and-spark-sql-tp26080p27773.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Bizarre behavior using Datasets/ML on Spark 2.0

2016-09-21 Thread Miles Crawford
Hello folks. I recently migrated my application to Spark 2.0, and everything worked well, except for one function that uses "toDS" and the ML libraries. This stage used to complete in 15 minutes or so on 1.6.2, and now takes almost two hours. The UI shows very strange behavior - completed stages

Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread Kevin Mellott
The "unresolved dependency" error is stating that the datastax dependency could not be located in the Maven repository. I believe that this should work if you change that portion of your command to the following. --packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.0-M3 You can

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
This is happening with sqoop and also putting data into Hbase table with command line Sqoop 1.4.6 Hadoop 2.7.3 Hive 2.0.1 I am still getting this error when using sqoop to get a simple table data from Oracle. sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username sh -P \

Re: How to use a custom filesystem provider?

2016-09-21 Thread Steve Loughran
On 21 Sep 2016, at 20:10, Jean-Philippe Martin > wrote: The full source for my example is available on github. I'm using maven to depend on

Re: How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin
> > There's a bit of confusion setting in here; the FileSystem implementations > spark uses are subclasses of org.apache.hadoop.fs.FileSystem; the nio > class with the same name is different. > grab the google cloud storage connector and put it on your classpath I was using the gs:// filesystem

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind
Hello Experts, Is there a way to get spark to write to elasticsearch asynchronously? Below are the details http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously regards Sunita

How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin
The full source for my example is available on github . I'm using maven to depend on gcloud-java-nio , which provides a Java FileSystem for Google Cloud Storage, via

Equivalent to --files for driver?

2016-09-21 Thread Everett Anderson
Hi, I'm running Spark 1.6.2 on YARN and I often use the cluster deploy mode with spark-submit. While the --files param is useful for getting files onto the cluster in the working directories of the executors, the driver's working directory doesn't get them. Is there some equivalent to --files

Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Michael Segel
I’ve asked this question a couple of times from a friend who didn’t know the answer… so I thought I would try here. Suppose we launch a job on a cluster (YARN) and we have set up the containers to be 3GB in size. What does that 3GB represent? I mean what happens if we end up using 2-3GB

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
Well I am left to use Spark for importing data from RDBMS table to Hadoop. You may argue why and it is because Spark does it in one process and no errors With sqoop I am getting this error message which leaves the RDBMS table data on HDFS file but stops there. 2016-09-21 21:00:15,084 [myid:] -

Re: Sqoop vs spark jdbc

2016-09-21 Thread Jörn Franke
I think there might be still something messed up with the classpath. It complains in the logs about deprecated jars and deprecated configuration files. > On 21 Sep 2016, at 22:21, Mich Talebzadeh wrote: > > Well I am left to use Spark for importing data from RDBMS

Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread Tathagata Das
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Wed, Sep 21, 2016 at 4:26 PM, ayan guha wrote: > Connection object is not serialisable. You need to implement a getorcreate > function which would run on each

Re: Sqoop vs spark jdbc

2016-09-21 Thread Don Drake
We just had this conversation at work today. We have a long sqoop pipeline and I argued to keep it in sqoop since we can take advantage of OraOop (direct mode) for performance and spark can't match that AFAIK. Sqoop also allows us to write directly into parquet format, which then Spark can read

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
I do not know why this happening. Trying to load an Hbase table at command line hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,c1,c2" t2 hdfs://rhes564:9000/tmp/crap.txt Comes back with this error 2016-09-22 00:12:46,576 INFO

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Sean Owen
No, Xmx only controls the maximum size of on-heap allocated memory. The JVM doesn't manage/limit off-heap (how could it? it doesn't know when it can be released). The answer is that YARN will kill the process because it's using more memory than it asked for. A JVM is always going to use a little

Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread ayan guha
Connection object is not serialisable. You need to implement a getorcreate function which would run on each executors to create hbase connection locally. On 22 Sep 2016 08:34, "KhajaAsmath Mohammed" wrote: > Hello Everyone, > > I am running spark application to push data

RE: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Arif,Mubaraka
we installed it but the kernel dies. Any clue, why ? thanks for the link :)- ~muby From: Jakob Odersky [ja...@odersky.com] Sent: Wednesday, September 21, 2016 4:54 PM To: Arif,Mubaraka Cc: User; Toivola,Sami Subject: Re: Has anyone installed the scala

Spark Application Log

2016-09-21 Thread Divya Gehlot
Hi, I have initialised the logging in my spark App /*Initialize Logging */ val log = Logger.getLogger(getClass.getName) Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) log.warn("Some text"+Somemap.size) When I run my spark job in using spark-submit like

Re: Israel Spark Meetup

2016-09-21 Thread Sean Owen
Done. On Wed, Sep 21, 2016 at 5:53 AM, Romi Kuntsman wrote: > Hello, > Please add a link in Spark Community page > (https://spark.apache.org/community.html) > To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/) > We're an active meetup group, unifying the

How to write multiple outputs in avro format in spark(java)?

2016-09-21 Thread Mahebub Sayyed
Hello, Currently I am writing multiple text files based on keys. Code : public class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat { @Override protected String generateFileNameForKeyValue(String key, String value, String name) { return

Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread KhajaAsmath Mohammed
Hello Everyone, I am running spark application to push data from kafka. I am able to get hbase kerberos connection successfully outside of functon before calling foreachrdd on Dstream. Job fails inside foreachrdd stating that hbaseconnection object is not serialized. could you please let me now

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Jörn Franke
All off-heap memory is still managed by the JVM process. If you limit the memory of this process then you limit the memory. I think the memory of the JVM process could be limited via the xms/xmx parameter of the JVM. This can be configured via spark options for yarn (be aware that they are

Re: Apache Spark JavaRDD pipe() need help

2016-09-21 Thread Jakob Odersky
Can you provide more details? It's unclear what you're asking On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com wrote: > Hi All, > > I am trying to use the JavaRDD.pipe() API. > > I have one object with me from the JavaRDD

Re: Task Deserialization Error

2016-09-21 Thread Gokula Krishnan D
Hello Sumit - I could see that SparkConf() specification is not being mentioned in your program. But rest looks good. Output: By the way, I have used the README.md template https://gist.github.com/jxson/1784669 Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Sep 20, 2016 at 2:15 AM,

Re: Task Deserialization Error

2016-09-21 Thread Jakob Odersky
Your app is fine, I think the error has to do with the way inttelij launches applications. Is your app forked in a new jvm when you run it? On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D wrote: > Hello Sumit - > > I could see that SparkConf() specification is not being

Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Arif,Mubaraka
Has anyone installed the scala kernel for Jupyter notebook.   Any blogs or steps to follow in appreciated.   thanks, Muby - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Jakob Odersky
One option would be to use Apache Toree. A quick setup guide can be found here https://toree.incubator.apache.org/documentation/user/quick-start On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka wrote: > Has anyone installed the scala kernel for Jupyter notebook. > > > > Any

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Chawla,Sumit
+1 Jakob. Thanks for the link Regards Sumit Chawla On Wed, Sep 21, 2016 at 2:54 PM, Jakob Odersky wrote: > One option would be to use Apache Toree. A quick setup guide can be > found here https://toree.incubator.apache.org/documentation/user/ > quick-start > > On Wed, Sep

Re: Task Deserialization Error

2016-09-21 Thread Chawla,Sumit
Thanks Guys. It was a classLoader issue. Rather than linking to the SPARK_HOME/assembly/target/scala-2.11/jars/ i was linking the individual jars. Linking to the folder instead solved the issue for me. Regards Sumit Chawla On Wed, Sep 21, 2016 at 2:51 PM, Jakob Odersky