Spark 1.0.2 failover doesnt port running application context to new master

2015-03-17 Thread Nirav Patel
We have spark 1.0.2 cluster with 3 nodes under HA setup using zookeeper. We have long running self contained spark service that serves on-demand requests. I tried to do failover test by killing spark master and see if our application get ported over to new master. Looks like killing master doesn't

JavaSparkContext - jarOfClass or jarOfObject dont work

2015-03-11 Thread Nirav Patel
Hi I am trying to run my spark service against cluster. As it turns out I have to do setJars and set my applicaiton jar in there. If I do it using physical path like following it works `conf.setJars(new String[]{/path/to/jar/Sample.jar});` but If i try to use JavaSparkContext (or SparkContext)

Creating RDD from Iterable from groupByKey results

2015-06-15 Thread Nirav Patel
I am trying to create new RDD based on given PairRDD. I have a PairRDD with few keys but each keys have large (about 100k) values. I want to somehow repartition, make each `Iterablev` into RDD[v] so that I can further apply map, reduce, sortBy etc effectively on those values. I am sensing

Spark executor jvm classloader not able to load nested jars

2015-11-02 Thread Nirav Patel
Hi, I have maven based mixed scala/java application that can submit spar jobs. My application jar "myapp.jar" has some nested jars inside lib folder. It's a fat jar created using spring-boot-maven plugin which nest other jars inside lib folder of parent jar. I prefer not to create shaded flat jar

Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Nirav Patel
As subject says, do we still need to use static env in every thread that access sparkContext? I read some ref here. http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe -- [image: What's New with Xactly]

Spark dynamic allocation - efficiently request new resource

2016-06-07 Thread Nirav Patel
Hi, Do current or future(2.0) spark dynamic allocation have capability to request a container with varying resource requirements based on various factor? Few factors I can think of is based on stage and data its processing it can either ask for more CPUs or more Memory. i.e. new executor can have

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
n't be particularly easy. > > On Wed, May 25, 2016 at 5:28 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> It's great that spark scheduler does optimized DAG processing and only >> does lazy eval when some action is performed or shuffle dependency is >

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
obs, > Stages, TaskSets and Tasks -- and when you start talking about Datasets and > Spark SQL, you then needing to start talking about tracking and mapping > concepts like Plans, Schemas and Queries. It would introduce significant > new complexity. > > On Wed, May 25, 2016 at 6:59 PM, N

Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
It's great that spark scheduler does optimized DAG processing and only does lazy eval when some action is performed or shuffle dependency is encountered. Sometime it goes further after shuffle dep before executing anything. e.g. if there are map steps after shuffle then it doesn't stop at shuffle

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-30 Thread Nirav Patel
exception. On Sun, May 29, 2016 at 11:26 PM, sjk <shijinkui...@163.com> wrote: > org.apache.hadoop.hbase.client.{Mutation, Put} > org.apache.hadoop.hbase.io.ImmutableBytesWritable > > if u used mutation, register the above class too > > On May 30, 2016, at 08:11, Nirav Pate

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
rsion are you using ? > > Thanks > > On Sun, May 29, 2016 at 4:26 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> Hi, >> >> I am getting following Kryo deserialization error when trying to buklload >> Cached RDD into Hbase. It works

Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
Hi, I am getting following Kryo deserialization error when trying to buklload Cached RDD into Hbase. It works if I don't cache the RDD. I cache it with MEMORY_ONLY_SER. here's the code snippet: hbaseRdd.values.foreachPartition{ itr => val hConf = HBaseConfiguration.create()

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
29, 2016, at 4:58 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > > I pasted code snipped for that method. > > here's full def: > > def writeRddToHBase2(hbaseRdd: RDD[(ImmutableBytesWritable, Put)], > tableName: String) { > > > hbase

Re: FullOuterJoin on Spark

2016-06-22 Thread Nirav Patel
Can your domain list fit in memory of one executor. if so you can use broadcast join. You can always narrow down to inner join and derive rest from original set if memory is issue there. If you are just concerned about shuffle memory then to reduce amount of shuffle you can do following: 1)

Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-21 Thread Nirav Patel
I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce

Re: spark job automatically killed without rhyme or reason

2016-06-22 Thread Nirav Patel
spark is memory hogger and suicidal if you have a job processing bigger dataset. however databricks claims that spark > 1.6 have optimization related to memory footprint as well as processing. It will only be available if you use dataframe or dataset. if you are using rdd you have to do lot of

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
incorrect result, did you observe any error (on > workers) ? > > Cheers > > On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> I have an RDD[String, MyObj] which is a result of Join + Map operation. >> It has no partiti

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
? On Wed, Jun 22, 2016 at 11:52 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > I do not see any indication of errors or executor getting killed in spark > UI - jobs, stages, event timelines. No task failures. I also don't see any > errors in executor logs. > > Than

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
e > maximum available limit. So the other options are > > 1) Separate the driver from master, i.e., run them on two separate nodes > 2) Increase the RAM capacity on the driver/master node. > > Regards, > Raghava. > > > On Wed, Jun 22, 2016 at 7:05 PM, Nirav Patel <

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
Yes driver keeps fair amount of meta data to manage scheduling across all your executors. I assume with 64 nodes you have more executors as well. Simple way to test is to increase driver memory. On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > It is an

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-23 Thread Nirav Patel
SO it was indeed my merge function. I created new result object for every merge and its working now. Thanks On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > PS. In my reduceByKey operation I have two mutable object. What I do is > merge mutable2 i

Re: Too many open files, why changing ulimit not effecting?

2016-02-05 Thread Nirav Patel
For centos there's also /etc/security/limits.d/90-nproc.conf that may need modifications. Services that you expect to use new limits needs to be restarted. Simple thing to do is to reboot the machine. On Fri, Feb 5, 2016 at 3:59 AM, Ted Yu wrote: > bq. and *"session

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
hy they have these optimizations in catalyst. RDD is > simply no longer the focus. > On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote: > >> so latest optimizations done on spark 1.4 and 1.5 releases are mostly >> from project Tungsten. Docs says it usues sun

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
because "beauty is in the eye of the > beholder" LOL > Regarding the comment on error prone, can you say why you think it is the > case? Relative to what other ways? > > Best Regards, > > Jerry > > > On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactl

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
cutor.memory would be the > max memory that your executor might have. But the memory that you get is > less then that. I don’t clearly remember but i think its either memory/2 or > memory/4. But I may be wrong as I have been out of spark for months. > > On Feb 3, 2016, at 2:58 PM, Nirav Pa

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
with Spark > <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Nirav Patel [mailto:npa...@xactlycorp.com] > *Sent:* Wednesday, February 3, 2016 11:31 AM > *To:* Stefan Panayotov > *Cc:* Jim Green; Ted Yu; Jakob Oders

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
b 3, 2016, at 1:02 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > > Yes, but you don't necessarily need to use dynamic allocation (just enable > the external shuffle service). > > On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Nirav Patel
tely not all implementations are >> available. for example i would like to use joins where one side is >> streaming (and the other cached). this seems to be available for DataFrame >> but not for RDD. >> >> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <npa...@

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
About OP. How many cores you assign per executor? May be reducing that number will give more portion of executor memory to each task being executed on that executor. Others please comment if that make sense. On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com> wrote: &g

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
wrote: > There is also (deprecated) spark.storage.unrollFraction to consider > > On Wed, Feb 3, 2016 at 2:21 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> What I meant is executor.cores and task.cpus can dictate how many >> parallel tasks will run on given exe

Re: Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-30 Thread Nirav Patel
1.3.1 artifact / dependency leaked into your > app ? > > Cheers > > On Thu, Jan 28, 2016 at 7:36 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client >> mode programmatically via cr

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
Hi Stefan, Welcome to the OOM - heap space club. I have been struggling with similar errors (OOM and yarn executor being killed) and failing job or sending it in retry loops. I bet the same job will run perfectly fine with less resource on Hadoop MapReduce program. I have tested it for my program

Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Hi, I have a spark job running on yarn-client mode. At some point during Join stage, executor(container) runs out of memory and yarn kills it. Due to this Entire job restarts! and it keeps doing it on every failure? What is the best way to checkpoint? I see there's checkpoint api and other

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
has types too! On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > I haven't gone through much details of spark catalyst optimizer and > tungston project but we have been advised by databricks support to use > DataFrame to resolve issues with OOM error that

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel
Hi, I read about release notes and few slideshares on latest optimizations done on spark 1.4 and 1.5 releases. Part of which are optimizations from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array before shuffle for optimized GC and memory. My

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
I dont understand why one thinks RDD of case object doesn't have types(schema) ? If spark can convert RDD to DataFrame which means it understood the schema. SO then from that point why one has to use SQL features to do further processing? If all spark need for optimizations is schema then what

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
But a simple way to improve things is to install the > Spark shuffle service on the YARN nodes, so that even if an executor > crashes, its shuffle output is still available to other executors. > > On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel <npa...@xactlycorp.com> > wrote:

Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here:

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
lem might be due to some race > conditions in exit period. The way you mentioned is still valid, this > problem only occurs when stopping the application. > > Thanks > Saisai > > On Fri, Jan 29, 2016 at 10:22 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >>

Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here:

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-24 Thread Nirav Patel
com> wrote: > > On 17 Feb 2016, at 01:29, Nirav Patel <npa...@xactlycorp.com> wrote: > > I think you are not getting my question . I know how to tune executor > memory settings and parallelism . That's not an issue. It's a specific > question about what happens when

How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-23 Thread Nirav Patel
Problem is I have RDD of about 10M rows and it keeps growing. Everytime when we want to perform query and compute on subset of data we have to use filter and then some aggregation. Here I know filter goes through each partitions and every rows of RDD which may not be efficient at all. Spark

Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
Hi, Perhaps I should write a blog about this that why spark is focusing more on writing easier spark jobs and hiding underlaying performance optimization details from a seasoned spark users. It's one thing to provide such abstract framework that does optimization for you so you don't have to

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
if you want > to write your own optimizations based on your own knowledge of the data > types and semantics that are hiding in your raw RDDs, there's no reason > that you can't do that. > > On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com> > wrote:

Re: Regarding Off-heap memory

2016-01-26 Thread Nirav Patel
>From my experience with spark 1.3.1 you can also set spark.executor.memoryOverhead to about 7-10% of your spark.executor.memory. Total of which will be requested for a Yarn container. On Tue, Jan 26, 2016 at 4:20 AM, Xiaoyu Ma wrote: > Hi all, > I saw spark 1.6 has

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-16 Thread Nirav Patel
them > out. Just increase the executor memory. Also considering increasing the > parallelism ie the number of partitions. > > Regards > Sab > >> On 11-Feb-2016 5:46 am, "Nirav Patel" <npa...@xactlycorp.com> wrote: >> In Yarn we have following

Re: Spark execuotr Memory profiling

2016-02-20 Thread Nirav Patel
yarn.executor.memoryOverhead","4000") > > conf = conf.set("spark.executor.cores","4").set("spark.executor.memory", > "15G").set("spark.executor.instances","6") > > Is it also possible to use reduceBy in

Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-10 Thread Nirav Patel
In Yarn we have following settings enabled so that job can use virtual memory to have a capacity beyond physical memory off course. yarn.nodemanager.vmem-check-enabled false yarn.nodemanager.pmem-check-enabled false vmem to pmem ration is 2:1. However spark

Spark execuotr Memory profiling

2016-02-10 Thread Nirav Patel
We have been trying to solve memory issue with a spark job that processes 150GB of data (on disk). It does a groupBy operation; some of the executor will receive somehwere around (2-4M scala case objects) to work with. We are using following spark config: "executorInstances": "15",

Re: Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
d be > "OufOfMemoryError: Direct Buffer Memory" or something else. > > On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Hi, >> >> We are using spark 1.5.2 or yarn. We have a spark application utilizing >> about 15

Compress individual RDD

2016-03-15 Thread Nirav Patel
Hi, I see that there's following spark config to compress an RDD. My guess is it will compress all RDDs of a given SparkContext, right? If so, is there a way to instruct spark context to only compress some rdd and leave others uncompressed ? Thanks spark.rdd.compress false Whether to compress

Re: Compress individual RDD

2016-03-15 Thread Nirav Patel
compress only rdds with serialization enabled in the persistence > mode. So you could skip _SER modes for your other rdds. Not perfect but > something. > On 15-Mar-2016 4:33 pm, "Nirav Patel" <npa...@xactlycorp.com> wrote: > >> Hi, >> >> I see that there

Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-01 Thread Nirav Patel
Hi, I have a spark jobs that runs on yarn and keeps failing at line where i do : val hConf = HBaseConfiguration.create hConf.setInt("hbase.client.scanner.caching", 1) hConf.setBoolean("hbase.cluster.distributed", true) new PairRDDFunctions(hbaseRdd).saveAsHadoopDataset(jobConfig)

Re: Spark execuotr Memory profiling

2016-03-01 Thread Nirav Patel
/>, > cheatsheet for tuning spark <http://techsuppdiva.github.io/spark1.6.html> > . > > Hope this helps, keep the community posted what resolved your issue if it > does. > > Thanks. > > Kuchekar, Nilesh > > On Sat, Feb 20, 2016 at 11:29 AM, Nirav Patel <npa...@xa

Re: Spark executor killed without apparent reason

2016-03-03 Thread Nirav Patel
as I manually killed application at some point after too many executors were getting killed. " ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM" Thanks On Wed, Mar 2, 2016 at 8:22 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > I think that was due to manually

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
so why does 'saveAsHadoopDataset' incurs so much memory pressure? Should I try to reduce hbase caching value ? On Wed, Mar 2, 2016 at 7:51 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > I have a spark jobs that runs on yarn and keeps failing at line where i do :

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
gt;> How wide are the rows in hbase table ? >> >> Thanks >> >> On Mar 3, 2016, at 1:01 AM, Nirav Patel <npa...@xactlycorp.com> wrote: >> >> so why does 'saveAsHadoopDataset' incurs so much memory pressure? Should >> I try to reduce hbase caching v

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
ndeed was high. > You can use binary search to get to a reasonable value for caching. > > Thanks > > On Thu, Mar 3, 2016 at 7:52 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Hi Ted, >> >> I'd say about 70th percentile keys have 2 columns each having

Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi, We are using spark 1.5.2 or yarn. We have a spark application utilizing about 15GB executor memory and 1500 overhead. However, at certain stage we notice higher GC time (almost same as task time) spent. These executors are bound to get killed at some point. However, nodemanager or resource

spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Nirav Patel
Hi, I am trying to use filterByRange feature of spark OrderedRDDFunctions in a hope that it will speed up filtering by scanning only required partitions. I have created Paired RDD with a RangePartitioner in one scala class and in another class I am trying to access this RDD and do following: In

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-04-02 Thread Nirav Patel
mally use is to zipWithIndex() and then use the filter >> operation. Filter is an O(m) operation where m is the size of your >> partition, not an O(N) operation. >> >> -Ilya Ganelin >> >> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> >

Multiple lookups; consolidate result and run further aggregations

2016-04-02 Thread Nirav Patel
I will start by question: Is spark lookup function on pair rdd is a driver action. ie result is returned to driver? I have list of Keys on driver side and I want to perform multiple parallel lookups on pair rdd which returns Seq[V]; consolidate results; and perform further

Re: spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-04-02 Thread Nirav Patel
DFunctions[K, V, (K, V)](rdd).sortByKey() > > See core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala > > On Wed, Mar 30, 2016 at 5:20 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> Hi, I am trying to use filterByRange feature of spark OrderedRD

SparkDriver throwing java.lang.OutOfMemoryError: Java heap space

2016-04-04 Thread Nirav Patel
Hi, We are using spark 1.5.2 and recently hitting this issue after our dataset grew from 140GB to 160GB. Error is thrown during shuffle fetch on reduce side which all should happen on executors and executor should report them! However its gets reported only on driver. SparkContext gets shutdown

GroubByKey Iterable[T] - Memory usage

2016-04-25 Thread Nirav Patel
Hi, Is the Iterable from out of GroupByKey is loaded fully into memory of reducer task or can it also be on disk? Also, is there a way to evacuate from memory once reducer is done iterating it and want to use memory for something else. Thanks -- [image: What's New with Xactly]

Re: aggregateByKey - external combine function

2016-04-29 Thread Nirav Patel
Any thoughts? I can explain more on problem but basically shuffle data doesn't seem to fit in reducer memory (32GB) and I am looking ways to process them on disk+memory. Thanks On Thu, Apr 28, 2016 at 10:07 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > I t

aggregateByKey - external combine function

2016-04-28 Thread Nirav Patel
Hi, I tried to convert a groupByKey operation to aggregateByKey in a hope to avoid memory and high gc issue when dealing with 200GB of data. I needed to create a Collection of resulting key-value pairs which represent all combinations of given key. My merge fun definition is as follows: private

Spark UI metrics - Task execution time and number of records processed

2016-05-18 Thread Nirav Patel
Hi, I have been noticing that for shuffled tasks(groupBy, Join) reducer tasks are not evenly loaded. Most of them (90%) finished super fast but there are some outliers that takes much longer as you can see from "Max" value in following metric. Metric is from Join operation done on two RDDs. I

API to study key cardinality and distribution and other important statistics about data at certain stage

2016-05-13 Thread Nirav Patel
Hi, Problem is every time job fails or perform poorly at certain stages you need to study your data distribution just before THAT stage. Overall look at input data set doesn't help very much if you have so many transformation going on in DAG. I alway end up writing complicated typed code to run

How to take executor memory dump

2016-05-11 Thread Nirav Patel
Hi, I am hitting OutOfMemoryError issues with spark executors. It happens mainly during shuffle. Executors gets killed with OutOfMemoryError. I have try setting up spark.executor.extraJavaOptions to take memory dump but its not happening. spark.executor.extraJavaOptions = "-XX:+UseCompressedOops

Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-03 Thread Nirav Patel
Hi, My spark application getting killed abruptly during a groupBy operation where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I see following in driver logs. Should not this logs be in executors? Anyhow looks like ByteBuffer is running out of memory. What will be workaround

Re: Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-06 Thread Nirav Patel
Is this a limit of spark shuffle block currently? On Tue, May 3, 2016 at 11:18 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > My spark application getting killed abruptly during a groupBy operation > where shuffle happens. All shuffle happens with PROCESS_LOCAL

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
https://spark.apache.org/docs/latest/running-on-yarn.html > > In cluster mode, spark.yarn.am.memory is not effective. > > For Spark 2.0, akka is moved out of the picture. > FYI > >> On Sat, May 7, 2016 at 8:24 PM, Nirav Patel <npa...@xactlycorp.com> wrote: >>

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
ing, consider increasing spark.driver.memory > > Cheers > > On Sun, May 8, 2016 at 9:14 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Yes, I am using yarn client mode hence I specified am settings too. >> What you mean akka is moved out of picture? I am using s

How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Hi, I thought I was using kryo serializer for shuffle. I could verify it from spark UI - Environment tab that spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator But when I see following error in Driver logs it

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
happen in executor JVM NOT in driver JVM. Thanks On Sat, May 7, 2016 at 11:58 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129) > > It was Akka which uses JavaSerializer > > Cheers > > On Sat, May 7, 20

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Jvm overhead based on num of executors , stages and tasks in your app. > Do you know your driver heap size and application structure ( num of stages > and tasks ) > > Ashish > > On Saturday, May 7, 2016, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Right but this l

DataframeWriter - How to change filename extension

2017-02-22 Thread Nirav Patel
Hi, I am writing Dataframe as TSV using DataframeWriter as follows: myDF.write.mode("overwrite").option("sep","\t").csv("/out/path") Problem is all part files have .csv extension instead of .tsv as follows: part-r-00012-f9f06712-1648-4eb6-985b-8a9c79267eef.csv All the records are stored in TSV

How PolynomialExpansion works

2016-09-16 Thread Nirav Patel
Doc says: Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it to above. Thanks -- [image: What's New with Xactly]

MulticlassClassificationEvaluator how weighted precision and weighted recall calculated

2016-10-03 Thread Nirav Patel
For example 3 class would it be? weightedPrecision = ( TP1 * w1 + TP2 * w2 + TP3 * w3) / ( TP1 * w1 + TP2 * w2 + TP3 * w3) + ( FP1 * w1 + FP2 * w2 + FP3 * w3) where TP1..2 are TP for each class. w1, w2.. are wight for each class based on their distribution in sample data? and similar for

Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
Hi, I built zeppeling 0.6 branch using spark 2.0 using following mvn : mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-2.0 -Pscala-2.11 -DskipTests Built went successful. I only have following set in zeppelin-conf.sh export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/ export

Re: Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
FYI, it works when I use MapR configured Spark 2.0. ie export SPARK_HOME=/opt/mapr/spark/spark-2.0.0-bin-without-hadoop Thanks Nirav On Mon, Sep 26, 2016 at 3:45 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > I built zeppeling 0.6 branch using spark 2.0 usin

Spark ML - Naive Bayes - how to select Threshold values

2016-11-07 Thread Nirav Patel
Few questions about `thresholds` parameter: This is what doc says "Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p

spark ml - ngram - how to preserve single word (1-gram)

2016-11-08 Thread Nirav Patel
Is it possible to preserve single token while using n-gram feature transformer? e.g. Array("Hi", "I", "heard", "about", "Spark") Becomes Array("Hi", "i", "heard", "about", "Spark", "Hi i", "I heard", "heard about", "about Spark") Currently if I want to do it I will have to manually transform

Spark SQL UDF - passing map as a UDF parameter

2016-11-14 Thread Nirav Patel
I am trying to use following API from Functions to convert a map into column so I can pass it to UDF. map(cols: Column *): Column

Re: Spark SQL UDF - passing map as a UDF parameter

2016-11-15 Thread Nirav Patel
rdd = sc.makeRDD(1 to 3).map(i => (i, 0)) > map(rdd.collect.flatMap(x => x._1 :: x._2 :: Nil).map(lit _): _*) > > // maropu > > On Tue, Nov 15, 2016 at 9:33 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> I am trying to use following API from Functions to

Is IDF model reusable

2016-11-01 Thread Nirav Patel
I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently, I fit IDF model on all sample data and then transform them. I read somewhere that I should split my data into training and test before fitting IDF model; Fit IDF only on training data and then use same

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
del before passing it through the next step. > > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > &g

Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > I am using IDF estimator/model (TF-IDF) to convert text features into

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
t; overfit your model. > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 1 Nov 2016, at 10:15,

Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Nirav Patel
I am running classification model. with normal training-test split I can check model accuracy and F1 score using MulticlassClassificationEvaluator. How can I do this with CrossValidation approach? Afaik, you Fit entire sample data in CrossValidator as you don't want to leave out any observation

Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data

2016-11-02 Thread Nirav Patel
It is very clear that for ML algorithms (classification, regression) that Estimator only fits on training data but it's not very clear of other estimators like IDF for example. IDF is a feature transformation model but having IDF estimator and transformer makes it little confusing that what

Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-02 Thread Nirav Patel
before you use > CrossValidator, in order to get an unbiased estimate of the best model's > performance. > > On Tue, Nov 1, 2016 at 12:10 PM Nirav Patel <npa...@xactlycorp.com> wrote: > >> I am running classification model. with normal training-test split I can

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
little elaborate setting you > may want to automate model evaluations, but that's a different story. > > Not sure if I could explain properly, please feel free to comment. > On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote: > >> Yes, I do apply NaiveBayes

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
aying same thing so thats a good thing :) > > On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> Hi Ayan, >> >> "classification algorithm will for sure need to Fit against new dataset >> to produce new model" I said th

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
f "model evaluation" work flow > typically in lower frequency than Re-Training process. > > On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Hi Ayan, >> After deployment, we might re-train it every month. That is whole >&g

Dynamic scheduling not respecting spark.executor.cores

2017-01-03 Thread Nirav Patel
When enabling dynamic scheduling I see that all executors are using only 1 core even if I specify "spark.executor.cores" to 6. If dynamic scheduling is disable then each executors will have 6 cores. I have tested this against spark 1.5 . I wonder if this is the same behavior with 2.x as well.

Re: Dynamic Allocation not respecting spark.executor.cores

2017-01-04 Thread Nirav Patel
If this is not an expected behavior then its should be logged as an issue. On Tue, Jan 3, 2017 at 2:51 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > When enabling dynamic scheduling I see that all executors are using only 1 > core even if I specify "spark.executor.cores&

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-14 Thread Nirav Patel
/jira/browse/SPARK-18579 > ). > > > > 2017-03-14 9:20 GMT+09:00 Nirav Patel <npa...@xactlycorp.com>: > >> Hi, >> >> Is there a document for each datasource (csv, tsv, parquet, json, avro) >> with available options ? I need to find one for csv to

DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Nirav Patel
Hi, Is there a document for each datasource (csv, tsv, parquet, json, avro) with available options ? I need to find one for csv to "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" Thanks -- [image: What's New with Xactly]

  1   2   >