In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Dan Bikle
hello spark-world, I am new to spark. I noticed this online example: http://spark.apache.org/docs/latest/ml-pipeline.html I am curious about this syntax: // Prepare training data from a list of (label, features) tuples. val training = spark.createDataFrame(Seq( (1.0,

pyspark ML example not working

2016-09-22 Thread jypucca
I installed Spark 2.0.0, and was trying the ML example (IndexToString) on this web page:http://spark.apache.org/docs/latest/ml-features.html#onehotencoder, using jupyter notebook (running Pyspark) to create a simple dataframe, and I keep getting a long error message (see below). Pyspark has

Re: spark stream on yarn oom

2016-09-22 Thread manasdebashiskar
It appears that the version against which your program is compiled is different than that of the spark version you are running your code against. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-on-yarn-oom-tp27766p27782.html Sent from

Re: In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Kevin Mellott
You'll want to use the spark-csv package, which is included in Spark 2.0. The repository documentation has some great usage examples. https://github.com/databricks/spark-csv Thanks, Kevin On Thu, Sep 22, 2016 at 8:40 PM, Dan Bikle wrote: > hello spark-world, > > I am new

Redshift Vs Spark SQL (Thrift)

2016-09-22 Thread ayan guha
Hi Is there any benchmark or point of view in terms of pros and cons between AWS Redshift vs Spark SQL through STS? -- Best Regards, Ayan Guha

Re: Spark RDD and Memory

2016-09-22 Thread Aditya
Thanks for the reply. One more question. How spark handles data if it does not fit in memory? The answer which I got is that it flushes the data to disk and handle the memory issue. Plus in below example. val textFile = sc.textFile("/user/emp.txt") val textFile1 = sc.textFile("/user/emp1.xt")

Fwd: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread sagarcasual .
Also you mentioned about streaming-kafka-0-10 connector, what connector is this, do you know the dependency ? I did not see mention of it in the documents For current Spark 1.6.1 to Kafka 0.10.0.1 standalone, the only dependencies I have are org.apache.spark:spark-core_2.10:1.6.1 compile group:

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so forth, where the original data partitioning needs to change. You could repeat your experiment for different code snippets. I'll bet it depends on what you do. On Thu, Sep 22, 2016 at 8:54 AM, gusiri

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Soumitra Johri
If your job involves a shuffle then the compute for the entire batch will increase with network latency. What would be interesting is to see how much time each task/job/stage takes. On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi wrote: > It seems to me they must

Re: very high maxresults setting (no collect())

2016-09-22 Thread Adrian Bridgett
Hi Michael, No spark upgrade, we've been changing some of our data pipelines so the data volumes have probably been getting a bit larger. Just in the last few weeks we've seen quite a few jobs needing a larger maxResultSize. Some jobs have gone from "fine with 1GB default" to 3GB.

Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed. Then,

Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Olivier Girardot
Hi,when we fetch Spark 2.0.0 as maven dependency then we automatically end up with hadoop 2.2 as a transitive dependency, I know multiple profiles are used to generate the different tar.gz bundles that we can download, Is there by any chance publications of Spark 2.0.0 with different classifier

Memory usage by Spark jobs

2016-09-22 Thread Hemant Bhanawat
I am working on profiling TPCH queries for Spark 2.0. I see lot of temporary object creation (sometimes size as much as the data size) which is justified for the kind of processing Spark does. But, from production perspective, is there a guideline on how much memory should be allocated for

Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread शशिकांत कुलकर्णी
Hello Jakob, Thanks for replying. Here is a short example of what I am trying. Taking an example of Product column family in Cassandra just for explaining my requirement In Driver.java { JavaRDD productsRdd = Get Products from Cassandra;

Open source Spark based projects

2016-09-22 Thread tahirhn
I am planning to write a thesis on certain aspects (i.e testing, performance optimisation, security) of Apache Spark. I need to study some projects that are based on Apache Spark and are available as open source. If you know any such project (open source Spark based project), Please share it

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Sean Owen
There can be just one published version of the Spark artifacts and they have to depend on something, though in truth they'd be binary-compatible with anything 2.2+. So you merely manage the dependency versions up to the desired version in your . On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <

Re: Open source Spark based projects

2016-09-22 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects and maybe related ... https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark On Thu, Sep 22, 2016 at 11:15 AM, tahirhn wrote: > I am planning to write a thesis on certain aspects (i.e testing,

Re: Spark Application Log

2016-09-22 Thread Bedrytski Aliaksandr
Hi Divya, Have you tried this command *yarn logs -applicationId application_x_ *? (where application_x_ is the id of the application and may be found in the output of the 'spark-submit' command or in the yarn's webui) It will collect the logs from all the executors

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
this is probably the best way to manage it On Thu, Sep 22, 2016 at 6:42 PM, Josh Rosen wrote: > Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped > at spark.memory.offHeap.size bytes. This is purposely specified as an > absolute size rather than

Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread Jakob Odersky
Hi Shashikant, I think you are trying to do too much at once in your helper class. Spark's RDD API is functional, it is meant to be used by writing many little transformations that will be distributed across a cluster. Appart from that, `rdd.pipe` seems like a good approach. Here is the relevant

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
I don't think I'd enable swap on a cluster. You'd rather processes fail than grind everything to a halt. You'd buy more memory or optimize memory before trading it for I/O. On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel wrote: > Ok… gotcha… wasn’t sure that YARN just

Re: Spark RDD and Memory

2016-09-22 Thread Mich Talebzadeh
Hi, unpersist works on storage memory not execution memory. So I do not think you can flush it out of memory if you have not cached it using cache or something like below in the first place. s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) s.unpersist I believe the recent versions

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
Well off-heap memory will be from an OS perspective be visible under the JVM process (you see the memory consumption of the jvm process growing when using off-heap memory). There is one exception: if there is another process, which has not been started by the JVM and "lives" outside the JVM, but

Re: sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
Hi, With the benefit of hindsight this thread should not have been posted in this forum rather more appropriately somewhere else like the Sqoop user group etc. My motivation was that in all probability one would have got a speedier response in this active forum that posting it somewhere else. I

Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread sagarcasual .
Hello, I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using Spark 1.6.1 I am receiving following error, and I confirmed that Topic to which I am trying to connect exists with the data . Any idea what could be the case? kafka.common.UnknownTopicOrPartitionException at

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-22 Thread andy petrella
heya, I'd say if you wanna go the spark and scala way, please yourself and go for the Spark Notebook (check http://spark-notebook.io/ for pre-built distro or build your own) hth On Thu, Sep 22, 2016 at 12:45 AM Arif,Mubaraka

Re: Open source Spark based projects

2016-09-22 Thread Sonal Goyal
https://spark-packages.org/ Thanks, Sonal Nube Technologies On Thu, Sep 22, 2016 at 3:48 PM, Sean Owen wrote: > https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects > and maybe related

Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-22 Thread KhajaAsmath Mohammed
Thanks Das and Ayan. Do you have any refrences on how to create connection pool for hbase inside foreachpartitions as mentioned in guide. In my case, I have to use kerberos hbase cluster. On Wed, Sep 21, 2016 at 6:39 PM, Tathagata Das wrote: >

Spark RDD and Memory

2016-09-22 Thread Aditya
Hi, Suppose I have two RDDs val textFile = sc.textFile("/user/emp.txt") val textFile1 = sc.textFile("/user/emp1.xt") Later I perform a join operation on above two RDDs val join = textFile.join(textFile1) And there are subsequent transformations without including textFile and textFile1 further

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Thanks for the response Sean. But how does YARN know about the off-heap memory usage? That’s the piece that I’m missing. Thx again, -Mike > On Sep 21, 2016, at 10:09 PM, Sean Owen wrote: > > No, Xmx only controls the maximum size of on-heap allocated memory. > The JVM

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
It's looking at the whole process's memory usage, and doesn't care whether the memory is used by the heap or not within the JVM. Of course, allocating memory off-heap still counts against you at the OS level. On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel wrote: >

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
I would disagree. While you can tune the system to not over subscribe, I would rather have it hit swap then fail. Especially on long running jobs. If we look at oversubscription on Hadoop clusters which are not running HBase… they survive. Its when you have things like HBase that don’t

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-22 Thread Cody Koeninger
Do you have the ability to try using Spark 2.0 with the streaming-kafka-0-10 connector? I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it would be good to rule that out. On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . wrote: > Hello, > > I am trying to

Re: Equivalent to --files for driver?

2016-09-22 Thread Jacek Laskowski
Hi Everett, I'd bet on --driver-class-path (but didn't check that out myself). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Sep 21, 2016 at 10:17 PM,

sqoop Imported and Hbase ImportTsv issue with Fled: No enum constant mapreduce.JobCounter.MB_MILLIS_MAPS

2016-09-22 Thread Mich Talebzadeh
Hi , I have been seeing errors at OS level when running sqoop import or hbase to get data into Hive and Sqoop respectively. The gist of the error is at the last line. 2016-09-22 10:49:39,472 [myid:] - INFO [main:Job@1356] - Job job_1474535924802_0003 completed successfully 2016-09-22

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Josh Rosen
Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped at spark.memory.offHeap.size bytes. This is purposely specified as an absolute size rather than as a percentage of the heap size in order to allow end users to tune Spark so that its overall memory consumption stays within

Is executor computing time affected by network latency?

2016-09-22 Thread gusiri
Hi, When I increase the network latency among spark nodes, I see compute time (=executor computing time in Spark Web UI) also increases. In the graph attached, left = latency 1ms vs right = latency 500ms. Is there any communication between worker and driver/master even 'during' executor

Re: Spark RDD and Memory

2016-09-22 Thread Hanumath Rao Maduri
Hello Aditya, After an intermediate action has been applied you might want to call rdd.unpersist() to let spark know that this rdd is no longer required. Thanks, -Hanu On Thu, Sep 22, 2016 at 7:54 AM, Aditya wrote: > Hi, > > Suppose I have two RDDs > val

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation and ignored the off heap. WRT over all OS memory… this would be one reason why I’d keep a decent amount of swap around. (Maybe even putting it on a fast device like an .m2 or PCIe flash drive…. > On Sep 22, 2016, at