Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-28 Thread Romi Kuntsman
Did you try to cache a DataFrame with just a single row? Do you rows have any columns with null values? Can you post a code snippet here on how you load/generate the dataframe? Does dataframe.rdd.cache work? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Oct 29, 2015 at 4:33

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
thrown from PixelObject? Are you running spark with master=local, so it's running inside your IDE and you can see the errors from the driver and worker? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu wrote: > Thanks Romi, > &g

Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-01 Thread Romi Kuntsman
(SparkContext.scala:103) at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005) at org.apache.spark.SparkContext.(SparkContext.scala:543) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) Th

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Romi Kuntsman
except "spark.master", do you have "spark://" anywhere in your code or config files? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 2, 2015 at 11:27 AM, Balachandar R.A. wrote: > > -- Forwarded message -- > From: "Bala

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
I noticed that toJavaRDD causes a computation on the DataFrame, so is it considered an action, even though logically it's a transformation? On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk" wrote: > Hello folks, > > Recently I have noticed unexpectedly big network traffic between Driver > Program and

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
uce on dataFrame without causing it to load all data to > driver program ? > > On Nov 4, 2015, at 12:34 PM, Romi Kuntsman wrote: > > I noticed that toJavaRDD causes a computation on the DataFrame, so is it > considered an action, even though logically it's a transformat

Re: JMX with Spark

2015-11-05 Thread Romi Kuntsman
Have you read this? https://spark.apache.org/docs/latest/monitoring.html *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Nov 5, 2015 at 2:08 PM, Yogesh Vyas wrote: > Hi, > How we can use JMX and JConsole to monitor our Spark applic

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
If they have a problem managing memory, wouldn't there should be a OOM? Why does AppClient throw a NPE? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das wrote: > Is that all you have in the executor logs? I suspect some of those jo

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Romi Kuntsman
ay be a network timeout etc) *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 9, 2015 at 6:00 PM, Akhil Das wrote: > Did you find anything regarding the OOM in the executor logs? > > Thanks > Best Regards > > On Mon, Nov 9, 2015 at 8:44 PM, Romi Kun

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Romi Kuntsman
I had many issues with shuffles (but not this one exactly), and what eventually solved it was to repartition to input into more parts. Have you tried that? P.S. not sure if related, but there's a memory leak in the shuffle mechanism https://issues.apache.org/jira/browse/SPARK-11293

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Romi Kuntsman
take executor memory times spark.shuffle.memoryFraction and divide the data so that each partition is less than the above *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Wed, Nov 18, 2015 at 2:09 PM, Tom Arnfeld wrote: > Hi Romi, > > Thanks! Could you give me an indi

diff between apps and waitingApps?

2015-12-01 Thread Romi Kuntsman
a/org/apache/spark/deploy/master/MasterSource.scala What is the meaning of "waitingApps"? And if the only place it's used is in "startExecutorsOnWorkers" where they are filtered as "app.coresLeft > 0", shouldn't that also be filtered in the reported me

Spark Application stuck retrying task failed on Java heap space?

2015-07-21 Thread Romi Kuntsman
Hello, *TL;DR: task crashes with OOM, but application gets stuck in infinite loop retrying the task over and over again instead of failing fast.* Using Spark 1.4.0, standalone, with DataFrames on Java 7. I have an application that does some aggregations. I played around with shuffling settings, w

Re: Timestamp functions for sqlContext

2015-07-21 Thread Romi Kuntsman
Hi Tal, I'm not sure there is currently a built-in function for it, but you can easily define a UDF (user defined function) by extending org.apache.spark.sql.api.java.UDF1, registering it (sparkContext.udf().register(...)), and then use it inside your query. RK. On Tue, Jul 21, 2015 at 7:04 PM

Applications metrics unseparatable from Master metrics?

2015-07-22 Thread Romi Kuntsman
Hi, I tried to enable Master metrics source (to get number of running/waiting applications etc), and connected it to Graphite. However, when these are enabled, application metrics are also sent. Is it possible to separate them, and send only master metrics without applications? I see that Master

Re: Scaling spark cluster for a running application

2015-07-22 Thread Romi Kuntsman
Are you running the Spark cluster in standalone or YARN? In standalone, the application gets the available resources when it starts. With YARN, you can try to turn on the setting *spark.dynamicAllocation.enabled* See https://spark.apache.org/docs/latest/configuration.html On Wed, Jul 22, 2015 at 2

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
What the throughput of processing and for how long do you need to remember duplicates? You can take all the events, put them in an RDD, group by the key, and then process each key only once. But if you have a long running application where you want to check that you didn't see the same value befor

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
spark RDD not fit for this requirement? > > On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman wrote: > >> What the throughput of processing and for how long do you need to >> remember duplicates? >> >> You can take all the events, put them in an RDD, group by the key

How to overwrite partition when writing Parquet?

2015-08-19 Thread Romi Kuntsman
Hello, I have a DataFrame, with a date column which I want to use as a partition. Each day I want to write the data for the same date in Parquet, and then read a dataframe for a date range. I'm using: myDataframe.write().partitionBy("date").mode(SaveMode.Overwrite).parquet(parquetDir); If I use

Re: Issues with S3 paths that contain colons

2015-08-19 Thread Romi Kuntsman
I had the exact same issue, and overcame it by overriding NativeS3FileSystem with my own class, where I replaced the implementation of globStatus. It's a hack but it works. Then I set the hadoop config fs.myschema.impl to my class name, and accessed the files through myschema:// instead of s3n://

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-19 Thread Romi Kuntsman
If you create a PairRDD from the DataFrame, using dataFrame.toRDD().mapToPair(), then you can call partitionBy(someCustomPartitioner) which will partition the RDD by the key (of the pair). Then the operations on it (like joining with another RDD) will consider this partitioning. I'm not sure that D

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Romi Kuntsman
tory of a particular partition > help? For directory structure, check this out... > > > http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery > > > On Wed, Aug 19, 2015 at 8:18 PM, Romi Kuntsman wrote: > >> Hello, >> >> I have a Da

How to remove worker node but let it finish first?

2015-08-23 Thread Romi Kuntsman
Hi, I have a spark standalone cluster with 100s of applications per day, and it changes size (more or less workers) at various hours. The driver runs on a separate machine outside the spark cluster. When a job is running and it's worker is killed (because at that hour the number of workers is redu

Re: Exception when S3 path contains colons

2015-08-25 Thread Romi Kuntsman
Hello, We had the same problem. I've written a blog post with the detailed explanation and workaround: http://labs.totango.com/spark-read-file-with-colon/ Greetings, Romi K. On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta wrote: > I am not quite sure about this but should the notation not be

Re: How to remove worker node but let it finish first?

2015-08-29 Thread Romi Kuntsman
://mesos.apache.org/documentation/latest/app-framework-development-guide/ > > Thanks > Best Regards > > On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman wrote: > >> Hi, >> I have a spark standalone cluster with 100s of applications per day, and >> it changes size (more or

How to determine the value for spark.sql.shuffle.partitions?

2015-09-01 Thread Romi Kuntsman
Hi all, The number of partition greatly affect the speed and efficiency of calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0. Too few partitions with large data cause OOM exceptions. Too many partitions on small data cause a delay due to overhead. How do you programmatically determin

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Romi Kuntsman
RDD is a set of data rows (in your case numbers), there is no meaning for the order of the items. What exactly are you trying to accomplish? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:29 PM, Zhiliang Zhu wrote: > Dear , > > I have took lots o

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors. To read from Cassandra, you can use something like this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:27 PM

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
foreach is something that runs on the driver, not the workers. if you want to perform some function on each record from cassandra, you need to do cassandraRdd.map(func), which will run distributed on the spark workers *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
, Sep 21, 2015 at 5:31 PM Cody Koeninger wrote: > That isn't accurate, I think you're confused about foreach. > > Look at > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > > On Mon, Sep

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Romi Kuntsman
Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2? I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu

Israel Spark Meetup

2016-09-20 Thread Romi Kuntsman
Hello, Please add a link in Spark Community page ( https://spark.apache.org/community.html) To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/) We're an active meetup group, unifying the local Spark user community, and having regular meetups. Thanks! Romi K.

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-19 Thread Romi Kuntsman
I have recently encountered a similar problem with Guava version collision with Hadoop. Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are they staying in version 11, does anyone know? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Wed, Jan 7, 2015 at

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-19 Thread Romi Kuntsman
Actually there is already someone on Hadoop-Common-Dev taking care of removing the old Guava dependency http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201501.mbox/browser https://issues.apache.org/jira/browse/HADOOP-11470 *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Romi Kuntsman
master. Using Spark 1.1.0. What if a master server is restarted, should worker retry to register on it? Greetings, -- *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com ​Join the Customer Success Manifesto <http://youtu.be/XvFi2Wh6wgU>

Spark job resource allocation best practices

2014-11-03 Thread Romi Kuntsman
y jobs run together, and together lets them use all the available resources? - How do you divide resources between applications on your usecase? P.S. I started reading about Mesos but couldn't figure out if/how it could solve the described issue. Thanks! *Romi Kuntsman*, *Big Data

Re: Spark job resource allocation best practices

2014-11-03 Thread Romi Kuntsman
ssible to divide the resources between them, according to how many are trying to run at the same time? So for example if I have 12 cores - if one job is scheduled, it will get 12 cores, but if 3 are scheduled, then each one will get 4 cores and then will all start. Thanks! *Romi Kuntsman*, *Big

Re: Dynamically switching Nr of allocated core

2014-11-03 Thread Romi Kuntsman
tart app A - it needs just 2 cores (as you said it will get even when there are 12 available), but gets nothing 4 - Until I stop app B, app A is stuck waiting, instead of app B freeing 2 cores and dropping to 10 cores. *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 3,

Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
How can I configure Mesos allocation policy to share resources between all current Spark applications? I can't seem to find it in the architecture docs. *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Tue, Nov 4, 2014 at 9:11 AM, Akhil Das wrote: > Yes. i believe Meso

Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
I have a single Spark cluster, not multiple frameworks and not multiple versions. Is it relevant for my use-case? Where can I find information about exactly how to make Mesos tell Spark how many resources of the cluster to use? (instead of the default take-all) *Romi Kuntsman*, *Big Data Engineer

Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
Let's say that I run Spark on Mesos in fine-grained mode, and I have 12 cores and 64GB memory. I run application A on Spark, and some time after that (but before A finished) application B. How many CPUs will each of them get? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com O

Snappy temp files not cleaned up

2014-11-05 Thread Romi Kuntsman
? Missing feature? How do you deal with build-up of temp files? Thanks, *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com

Task time measurement

2014-11-11 Thread Romi Kuntsman
Hello, Currently in Spark standalone console, I can only see how long the entire job took. How can I know how long it was in WAITING and how long in RUNNING, and also when running, how much each of the jobs inside took? Thanks, *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

2014-11-24 Thread Romi Kuntsman
mory map of 12 MB to disk (36 times so far) 14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 11 MB to disk (37 times so far) 14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task 'attempt_201411241250__m_00_90' to s3n://mybucket/mydir/outp

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

2014-11-26 Thread Romi Kuntsman
mory map of 12 MB to disk (36 times so far) 14/11/24 13:13:45 INFO ExternalAppendOnlyMap: Thread 64 spilling in-memory map of 11 MB to disk (37 times so far) 14/11/24 13:13:56 INFO FileOutputCommitter: Saved output of task 'attempt_201411241250__m_00_90' to s3n://mybucket/mydir/outp

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Romi Kuntsman
About version compatibility and upgrade path - can the Java application dependencies and the Spark server be upgraded separately (i.e. will 1.1.0 library work with 1.1.1 server, and vice versa), or do they need to be upgraded together? Thanks! *Romi Kuntsman*, *Big Data Engineer* http