Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Sahil Sareen
I'm trying to do this in unit tests: val sConf = new SparkConf() .setAppName("RandomAppName") .setMaster("local") val sc = new SparkContext(sConf) val sqlContext = new TestHiveContext(sc) // tried new HiveContext(sc) as well But I get this: *[scalatest] **Exception

Re: Spark Java.lang.NullPointerException

2015-12-11 Thread Steve Loughran
> On 11 Dec 2015, at 05:14, michael_han wrote: > > Hi Sarala, > I found the reason, it's because when spark run it still needs Hadoop > support, I think it's a bug in Spark and still not fixed now ;) > It's related to how the hadoop filesystem apis are used to access

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list Ok, looks like when the KCL version was updated in https://github.com/apache/spark/pull/8957, the AWS SDK version was not, probably leading to dependency conflict, though as Burak mentions its hard to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally and on my

Re:Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Bonsen
Thank you,and I find the problem is my package is test,but I write package org.apache.spark.examples ,and IDEA had imported the spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and it makes lots of problems __ Now , I

coGroup problem /spark streaming

2015-12-11 Thread Vieru, Mihail
Hi, we have the following use case for stream processing. We have incoming events of the following form: (itemID, orderID, orderStatus, timestamp). There are several itemIDs for each orderID, each with its own timestamp. The orderStatus can be "created" and "sent". The first incoming itemID

spark metrics in graphite missing for some executors

2015-12-11 Thread rok
I'm using graphite/grafana to collect and visualize metrics from my spark jobs. It appears that not all executors report all the metrics -- for example, even jvm heap data is missing from some. Is there an obvious reason why this happens? Are metrics somehow held back? Often, an executor's

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
I have a job that is running into intermittent errors with [SparkDriver] java.lang.OutOfMemoryError: Java heap space. Before I was getting this error I was getting errors saying the result size exceed the spark.driver.maxResultSize. This does not make any sense to me, as there are no actions in

Re: Mesos scheduler obeying limit of tasks / executor

2015-12-11 Thread Iulian Dragoș
Hi Charles, I am not sure I totally understand your issues, but the spark.task.cpus limit is imposed at a higher level, for all cluster managers. The code is in TaskSchedulerImpl

Creation of RDD in foreachAsync is failing

2015-12-11 Thread Madabhattula Rajesh Kumar
Hi, I have a below query. Please help me to solve this I have a 2 ids. I want to join these ids to table. This table contains some blob data. So i can not join these 2 ids to this table in one step. I'm planning to join this table in a chunks. For example, each step I will join 5000

Compiling ERROR for Spark MetricsSystem

2015-12-11 Thread Haijia Zhou
Hi, 0 down vote favorite I'm trying to register a custom Spark metrics source to Spark metrics system with following code: val source = new CustomMetricSource()

Re: default parallelism and mesos executors

2015-12-11 Thread Iulian Dragoș
On Wed, Dec 9, 2015 at 4:29 PM, Adrian Bridgett wrote: > (resending, text only as first post on 2nd never seemed to make it) > > Using parallelize() on a dataset I'm only seeing two tasks rather than the > number of cores in the Mesos cluster. This is with spark 1.5.1 and

Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Afshartous, Nick
Hi, I'm trying to run a streaming job on a single node EMR 4.1/Spark 1.5 cluster. Its throwing an IllegalArgumentException right away on the submit. Attaching full output from console. Thanks for any insights. -- Nick 15/12/11 16:44:43 WARN util.NativeCodeLoader: Unable to load

spark-submit problems with --packages and --deploy-mode cluster

2015-12-11 Thread Greg Hill
I'm using Spark 1.5.0 with the standalone scheduler, and for the life of me I can't figure out why this isn't working. I have an application that works fine with --deploy-mode client that I'm trying to get to run in cluster mode so I can use --supervise. I ran into a few issues with my

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
Yes, it's against master: https://github.com/apache/spark/pull/10256 I'll push the KCL version bump after my local tests finish. On Fri, Dec 11, 2015 at 10:42 AM Nick Pentreath wrote: > Is that PR against master branch? > > S3 read comes from Hadoop / jet3t afaik > >

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Brian London
That's good news I've got a PR in to up the SDK version to 1.10.40 and the KCL to 1.6.1 which I'm running tests on locally now. Is the AWS SDK not used for reading/writing from S3 or do we get that for free from the Hadoop dependencies? On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath

Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-11 Thread Roberto Coluccio
Thanks for your advice, Steve. I'm mainly talking about application logs. To be more clear, just for instance think about the "//hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout". So YARN's applications containers logs, stored (at least for EMR's hadoop 2.4) on DataNodes

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch? S3 read comes from Hadoop / jet3t afaik — Sent from Mailbox On Fri, Dec 11, 2015 at 5:38 PM, Brian London wrote: > That's good news I've got a PR in to up the SDK version to 1.10.40 and the > KCL to 1.6.1 which I'm running tests

UNSUBSCRIBE

2015-12-11 Thread williamtellme123

Re: Window function in Spark SQL

2015-12-11 Thread Michael Armbrust
Can you change permissions on that directory so that hive can write to it? We start up a mini version of hive so that we can use some of its functionality. On Fri, Dec 11, 2015 at 12:47 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.x whenever I try to create a HiveContext

Re: Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Todd Simmer
hey Mike, Are these part of an Active Directory Domain? If so are they pointed at the AD domain controllers that hosts the Kerberos server? Windows AD create SRV records in DNS to help windows clients find the Kerberos server for their domain. If you look you can see if you have a kdc record in

UNSUBSCRIBE

2015-12-11 Thread williamtellme123

imposed dynamic resource allocation

2015-12-11 Thread Antony Mayi
Hi, using spark 1.5.2 on yarn (client mode) and was trying to use the dynamic resource allocation but it seems once it is enabled by first app then any following application is managed that way even if explicitly disabling. example:1) yarn configured with 

Spark REST API shows Error 503 Service Unavailable

2015-12-11 Thread prateek arora
Hi I am trying to access Spark Using REST API but got below error : Command : curl http://:18088/api/v1/applications Response: Error 503 Service Unavailable HTTP ERROR 503 Problem accessing /api/v1/applications. Reason: Service Unavailable Caused by:

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Harsh J
General note: The /root is a protected local directory, meaning that if your program spawns as a non-root user, it will never be able to access the file. On Sat, Dec 12, 2015 at 12:21 AM Zhan Zhang wrote: > As Sean mentioned, you cannot referring to the local file in

How to display column names in spark-sql output

2015-12-11 Thread Ashwin Shankar
Hi, When we run spark-sql, is there a way to get column names/headers with the result? -- Thanks, Ashwin

Re: How to display column names in spark-sql output

2015-12-11 Thread Ashwin Sai Shankar
Never mind, its *set hive.cli.print.header=true* Thanks ! On Fri, Dec 11, 2015 at 5:16 PM, Ashwin Shankar wrote: > Hi, > When we run spark-sql, is there a way to get column names/headers with the > result? > > -- > Thanks, > Ashwin > > >

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's a feature on the development, please refer SPARK-10413. 2015-12-10 4:37 GMT+08:00 Eugene Morozov : > Hello, > > I'm using RandomForest pipeline (ml package). Everything is working fine >

RE: Re: Spark assembly in Maven repo?

2015-12-11 Thread Xiaoyong Zhu
Yes, so our scenario is to treat the spark assembly as an “SDK” so users can develop Spark applications easily without downloading them. In this case which way do you guys think might be good? Xiaoyong From: fightf...@163.com [mailto:fightf...@163.com] Sent: Friday, December 11, 2015 12:08 AM

Re: Inverse of the matrix

2015-12-11 Thread Zhiliang Zhu
use matrix SVD decomposition  and spark has the lib . http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#singular-value-decomposition-svd   On Thursday, December 10, 2015 7:33 PM, Arunkumar Pillai wrote: Hi I need to find inverse

Re: Classpath problem trying to use DataFrames

2015-12-11 Thread Harsh J
Do you have all your hive jars listed in the classpath.txt / SPARK_DIST_CLASSPATH env., specifically the hive-exec jar? Is the location of that jar also the same on all the distributed hosts? Passing an explicit executor classpath string may also help overcome this (replace HIVE_BASE_DIR to the

Classpath problem trying to use DataFrames

2015-12-11 Thread Christopher Brady
I'm trying to run a basic "Hello world" type example using DataFrames with Hive in yarn-client mode. My code is: JavaSparkContext sc = new JavaSparkContext("yarn-client", "Test app")) HiveContext sqlContext = new HiveContext(sc.sc()); sqlContext.sql("SELECT * FROM my_table").count(); The

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Michael Armbrust
The way that we do this is to have a single context with a server in front that multiplexes jobs that use that shared context. Even if you aren't sharing data this is going to give you the best fine grained sharing of the resources that the context is managing. On Fri, Dec 11, 2015 at 10:55 AM,

Re: Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
I noticed that it is configurable in job level spark.task.cpus. Anyway to support on task level? Thanks. Zhan Zhang On Dec 11, 2015, at 10:46 AM, Zhan Zhang wrote: > Hi Folks, > > Is it possible to assign multiple core per task and how? Suppose we have some >

Re: Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Michael Armbrust
Just use TestHive. Its a global singlton that you can share across test cases. It has a reset function if you want to clear out the state at the begining of a test. On Fri, Dec 11, 2015 at 2:06 AM, Sahil Sareen wrote: > I'm trying to do this in unit tests: > > val

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright
Thanks for the insight! ___ *Mike Wright* Principal Architect, Software Engineering S Capital IQ and SNL 434-951-7816 *p* 434-244-4466 *f* 540-470-0119 *m* mwri...@snl.com On Fri, Dec 11, 2015 at 2:38 PM, Michael Armbrust wrote: > The way that we do

how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Hi, I have problem accessing local file, with such example: sc.textFile("file:///root/2008.csv").count() with error: File file:/root/2008.csv does not exist. The file clearly exists since, since if I missed type the file name to an non-existing one, it will show: Error: Input path does not

Re: Compiling ERROR for Spark MetricsSystem

2015-12-11 Thread Jean-Baptiste Onofré
Hi, can you check the scala version you use ? 2.10 or 2.11 ? Thanks, Regards JB On 12/11/2015 05:47 PM, Haijia Zhou wrote: Hi, 0down vote favorite ** I'm trying to register a

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
This issue is due to file permission issue. You need to execute spark operations using root command only. Regards, Vijay Gharge On Fri, Dec 11, 2015 at 11:20 PM, Vijay Gharge wrote: > One more question. Are you also running spark commands using root user ? >

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Sean Owen
Hm, are you referencing a local file from your remote workers? That won't work as the file only exists in one machine (I presume). On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao wrote: > Hi, > > > > I have problem accessing local file, with such example: > > > >

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Here you go, thanks. -rw-r--r-- 1 root root 658M Dec 9 2014 /root/2008.csv From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:31 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
Please ignore typo. I meant root "permissions" Regards, Vijay Gharge On Fri, Dec 11, 2015 at 11:30 PM, Vijay Gharge wrote: > This issue is due to file permission issue. You need to execute spark > operations using root command only. > > > > Regards, > Vijay Gharge > >

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
Can you provide output of "ls -lh /root/2008.csv" ? On Friday 11 December 2015, Lin, Hao wrote: > Hi, > > > > I have problem accessing local file, with such example: > > > > sc.textFile("file:///root/2008.csv").count() > > > > with error: File file:/root/2008.csv does not

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
One more question. Are you also running spark commands using root user ? Meanwhile am trying to simulate this locally. On Friday 11 December 2015, Lin, Hao wrote: > Here you go, thanks. > > > > -rw-r--r-- 1 root root 658M Dec 9 2014 /root/2008.csv > > > > *From:* Vijay

Window function in Spark SQL

2015-12-11 Thread Sourav Mazumder
Hi, Spark SQL documentation says that it complies with Hive 1.2.1 APIs and supports Window functions. I'm using Spark 1.5.0. However, when I try to execute something like below I get an error val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from lolt")

Re: Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Jean-Baptiste Onofré
Hi Nick, the localizedPath has to be not null, that's why the requirement fails. In the SparkConf used by the spark-submit (default in conf/spark-default.conf), do you have all properties defined, especially spark.yarn.keytab ? Thanks, Regards JB On 12/11/2015 05:49 PM, Afshartous, Nick

Re: Window function in Spark SQL

2015-12-11 Thread Ross.Cramblit
Hey Sourav, Window functions require using a HiveContext rather than the default SQLContext. See here: http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext HiveContext provides all the same functionality of SQLContext, as well as extra features like Window

Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan
Hi Harsh, Thanks a lot for your reply. I added a predicate to my query to select a single partition in the table, and tested with both "spark.sql.hive.metastorePartitionPruning" setting on and off, and there is no difference in DataFrame creation time. Yes, Michael's proposed workaround works.

Re: Re: Spark assembly in Maven repo?

2015-12-11 Thread fightf...@163.com
Agree with you that assembly jar is not good to publish. However, what he really need is to fetch an updatable maven jar file. fightf...@163.com From: Mark Hamstra Date: 2015-12-11 15:34 To: fightf...@163.com CC: Xiaoyong Zhu; Jeff Zhang; user; Zhaomin Xu; Joe Zhang (SDE) Subject: Re: RE:

Re: Spark Submit - java.lang.IllegalArgumentException: requirement failed

2015-12-11 Thread Afshartous, Nick
Thanks JB. I'm submitting from the AWS Spark master node, the spark-default.conf is pre-deployed by Amazon (attached) and there is no setting for spark.yarn.keytab. Is there any doc for setting this up if required in this scenario ? Also, I if deploy-mode is switched from cluster to client

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang
I think you are fetching too many results to the driver. Typically, it is not recommended to collect much data to driver. But if you have to, you can increase the driver memory, when submitting jobs. Thanks. Zhan Zhang On Dec 11, 2015, at 6:14 AM, Tom Seddon

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi
Do you have a large number of tasks? This can happen if you have a large number of tasks and a small driver or if you use accumulators of lists like datastructures. 2015-12-11 11:17 GMT-08:00 Zhan Zhang : > I think you are fetching too many results to the driver.

Re: Mesos scheduler obeying limit of tasks / executor

2015-12-11 Thread Charles Allen
That answers it thanks! On Fri, Dec 11, 2015 at 6:37 AM Iulian Dragoș wrote: > Hi Charles, > > I am not sure I totally understand your issues, but the spark.task.cpus > limit is imposed at a higher level, for all cluster managers. The code is > in TaskSchedulerImpl >

Fwd: Window function in Spark SQL

2015-12-11 Thread Sourav Mazumder
In 1.5.x whenever I try to create a HiveContext from SparkContext I get following error. Please note that I'm not running any Hadoop/Hive server in my cluster. I'm only running Spark. I never faced HiveContext creation problem like this previously in 1.4.x. Is it now a requirement in 1.5.x that

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Yes to your question. I have spun up a cluster, login to the master as a root user, run spark-shell, and reference the local file of the master machine. From: Vijay Gharge [mailto:vijay.gha...@gmail.com] Sent: Friday, December 11, 2015 12:50 PM To: Lin, Hao Cc: user@spark.apache.org Subject: Re:

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
I logged into master of my cluster and referenced the local file of the master node machine. And yes that file only resides on master node, not on any of the remote workers. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, December 11, 2015 1:00 PM To:

Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Mike Wright
As part of our implementation, we are utilizing a full "Kerberized" cluster built on the Hortonworks suite. We're using Job Server as the front end to initiate short-run jobs directly from our client-facing product suite. 1) We believe we have configured the job server to start with the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
It looks like you have an issue with your classpath, I think it is because you add a jar containing Spark twice: first, you have a dependency on Spark somewhere in your build tool (this allows you to compile and run your application), second you re-add Spark here >

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Mike Wright
Somewhat related - What's the correct implementation when you have a single cluster to support multiple jobs that are unrelated and NOT sharing data? I was directed to figure out, via job server, to support "multiple contexts" and explained that multiple contexts per JVM is not really supported.

Re: Performance does not increase as the number of workers increasing in cluster mode

2015-12-11 Thread Zhan Zhang
Not sure your data and model size. But intuitively, there is a tradeoff between parallel and network overhead. With the same data set and model, there is a optimum point of cluster size (performance may degrade at some point with the cluster size increment). You may want to test larger data

cluster mode uses port 6066 Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-11 Thread Andy Davidson
Hi Andrew You are correct I am using cluster mode. Many thanks Andy From: Andrew Or Date: Thursday, December 10, 2015 at 6:31 PM To: Andrew Davidson Cc: Jakob Odersky , "user @spark"

HDFS

2015-12-11 Thread shahid ashraf
hi Folks I am using standalone cluster of 50 servers on aws. i loaded data on hdfs, why i am getting Locality Level as ANY for data on hdfs, i have 900+ partitions. -- with Regards Shahid Ashraf

Multi-core support per task in Spark

2015-12-11 Thread Zhan Zhang
Hi Folks, Is it possible to assign multiple core per task and how? Suppose we have some scenario, in which some tasks are really heavy processing each record and require multi-threading, and we want to avoid similar tasks assigned to the same executors/hosts. If it is not supported, does it

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Zhan Zhang
As Sean mentioned, you cannot referring to the local file in your remote machine (executors). One walk around is to copy the file to all machines within same directory. Thanks. Zhan Zhang On Dec 11, 2015, at 10:26 AM, Lin, Hao > wrote: of the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
Btw, Spark 1.5 comes with support for hadoop 2.2 by default On 11 December 2015 at 03:08, Bonsen wrote: > Thank you,and I find the problem is my package is test,but I write package > org.apache.spark.examples ,and IDEA had imported the > spark-examples-1.5.2-hadoop2.6.0.jar