回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
Hi Ted and Grace, Retried with Spark 1.4.0,still failed with same phenomenon.here is a log.FYI. What else details may help?BTW, is it a necessary step to run Hibench test for my spark cluster? I also tried to skip building Hibench to execute bin/run-all.sh, but also got

Day of year

2015-07-08 Thread Ravisankar Mani
Hi everyone, I can't get 'day of year' when using spark sql query. Can you help any way to achieve day of year? Regards, Ravi

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
This is most likely due to the internal implementation of ALS in MLib. Probably for each parallel unit of execution (partition in Spark terms) the implementation allocates and uses a RAM buffer where it keeps interim results during the ALS iterations If we assume that the size of that

Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
HI Vinod, Yes If you want to use a scala or python function you need the block of code. Only Hive UDF's are available permanently. Thanks, Vishnu On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar vinodsachin...@gmail.com wrote: Thanks Vishnu, When restart the service the UDF was not accessible

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-08 Thread Daniel Haviv
Hi, Just updating back that setting spark.driver.extraClassPath worked. Thanks, Daniel On Fri, Jul 3, 2015 at 5:35 PM, Ted Yu yuzhih...@gmail.com wrote: Alternatively, setting spark.driver.extraClassPath should work. Cheers On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran

Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Aniruddh Sharma
Hi, I am new to Spark. I have done following tests and I am confused in conclusions. I have 2 queries. Following is the detail of test Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 executors

Re: UDF in spark

2015-07-08 Thread vinod kumar
Thanks Vishnu, When restart the service the UDF was not accessible by my query.I need to run the mentioned block again to use the UDF. Is there is any way to maintain UDF in sqlContext permanently? Thanks, Vinod On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com

Re: UDF in spark

2015-07-08 Thread vinod kumar
Thank you for quick response Vishnu, I have following doubts too. 1.Is there is anyway to upload files to HDFS programattically using c# language?. 2.Is there is any way to automatically load scala block of code (for UDF) when i start the spark service? -Vinod On Wed, Jul 8, 2015 at 7:57 AM,

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Are you sure you have actually increased the RAM (how exactly did you do that and does it show in Spark UI) Also use the SPARK UI and the driver console to check the RAM allocated for each RDD and RDD partion in each of the scenarios Re b) the general rule is num of partitions = 2 x

Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole
Hi All, I'd like to use the hive context in spark shell, i need to recreate the hive meta database in the same location, so i want to close the derby connection previous created in the spark shell, is there any way to do this? I try this, but it does not work:

Announcement of the webinar in the newsletter and on the site

2015-07-08 Thread Oleh Rozvadovskyy
Hi there, My name is Oleh Rozvadovskyy. I represent CyberVision Inc., the IoT company and the developer of Kaa IoT platform, which is open-source middleware for smart devices and servers. In a 2 weeks period we're going to run a webinar *IoT data ingestion in Spark Streaming using Kaa on Thu,

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM To:

Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
Hi, sqlContext.udf.register(udfname, functionname _) example: def square(x:Int):Int = { x * x} register udf as below sqlContext.udf.register(square,square _) Thanks, Vishnu On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am new to spark.may I

Problem in Understanding concept of Physical Cores

2015-07-08 Thread Aniruddh Sharma
Hi I am new to Spark. Following is the problem that I am facing Test 1) I ran a VM on CDH distribution with only 1 core allocated to it and I ran simple Streaming example in spark-shell with sending data on port and trying to read it. With 1 core allocated to this nothing happens in my

回复:回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
should I add dependencies for spark-core_2.10,spark-yarn_2.10,spark-streaming_2.10, org.apache.spark:spark-mllib_2.10,:spark-hive_2.10,:spark-graphx_2.10 in pom.xml?if yes, there are 7 pom.xml in HiBench listing below, which one to modify? [root@spark-study HiBench-master]# find ./ -name

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
My apologies for double posting but I missed the web links that i followed which are 1 http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/, 2 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/, 3

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Ashish Dutt
Thanks you Akhil for the link Sincerely, Ashish Dutt PhD Candidate Department of Information Systems University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia On Wed, Jul 8, 2015 at 3:43 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Have a look

Re: spark - redshift !!!

2015-07-08 Thread spark user
Hi 'I am looking how to load data in redshift .Thanks  On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com wrote: Hi, I did some experiment with loading data from s3 into spark. I loaded data from s3 using sc.textFile(). Have a look at the following code

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
Thanks. Actually I've find the way. I'm using spark-submit to submit the job the a YARN cluster with --mater yarn-cluster (which spark-submit process is not the driver). So I can config spark.yarn.submit.waitAppComplettion to false so that the process will exit after the job is submitted. ayan

Using different users with spark thriftserver

2015-07-08 Thread Zalzberg, Idan (Agoda)
Hi, We are using spark thrift server as a hive replacement. One of the things we have with hive, is that different users can connect with their own usernames/passwords and get appropriate permissions. So on the same server, one user may have a query that will have permissions to run, while the

SnappyCompressionCodec on the master

2015-07-08 Thread nizang
hi, I'm running spark standalone cluster (1.4.0). I have some applications running with scheduler every hour. I found that on one of the executions, the job got to be FINISHED after very few seconds (instead of ~5 minutes), and in the logs on the master, I can see the following exception:

Re: is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-08 Thread Konstantinos Kougios
seems you're correct: 2015-07-07 17:21:27,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=38506,containerID=container_1436262805092_0022_01_03] is running be yond virtual memory limits. Current usage: 4.3 GB of 4.5 GB

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
Hello Sooraj, I see you are using ipython notebook. Can you tell me are you on Windows OS or Linux based OS? I am using Windows 7 and I am new to Spark. I am trying to connect ipython with my local cluster based on CDH5.4. I followed these tutorials here but they are written on linux environment

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Have a look http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala, create two threads and call thread1.start(), thread2.start() Thanks Best Regards On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt ashish.du...@gmail.com wrote: Thanks for your reply Akhil. How do you

Word2Vec distributed?

2015-07-08 Thread Carsten Schnober
Hi, I've been experimenting with the Spark Word2Vec implementation in the MLLib package. It seems to me that only the preparatory steps are actually performed in a distributed way, i.e. stages 0-2 that prepare the data. In stage 3 (mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems

How to upgrade Spark version in CDH 5.4

2015-07-08 Thread Ashish Dutt
Hi, I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4. I checked the documentation here

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
That turned out to be a silly data type mistake. At one point in the iterative call, I was passing an integer value for the parameter 'alpha' of the ALS train API, which was expecting a Double. So, py4j in fact complained that it cannot take a method that takes an integer value for that parameter.

Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Prateek .
Hi I am beginner to scala and spark. I am trying to set up eclipse environment to develop spark program in scala, then take it's jar for spark-submit. How shall I start? To start my task includes, setting up eclipse for scala and spark, getting dependencies resolved, building project using

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Ashish Dutt
Hello Prateek, I started with getting the pre built binaries so as to skip the hassle of building them from scratch. I am not familiar with scala so can't comment on it. I have documented my experiences on my blog www.edumine.wordpress.com Perhaps it might be useful to you. On 08-Jul-2015 9:39

RE: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Sun, Rui
Hi, Evgeny, I reported a JIRA issue for your problem: https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how it will be solved. Ray -Original Message- From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com] Sent: Monday, July 6, 2015 7:27 PM To:

Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark master ip:port but its been unsucessful. The server contains data that needs to be cleaned and analysed. The cluster and the nodes are on linux

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com wrote: Is there a set of best practices for when to use foreachPartition vs. foreachRDD?

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using one

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Srikanth
Your tableLoad() APIs are not actions. File will be read fully only when an action is performed. If the action is something like table1.join(table2), then I think both files will be read in parallel. Can you try that and look at the execution plan or in 1.4 this is shown in Spark UI. Srikanth On

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description run at ThreadPoolExecutor.java:1142. Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionException

Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
Hello. I have an issue with CustomKryoRegistrator, which causes ClassNotFound on Worker. The issue is resolved if call SparkConf.setJar with path to the same jar I run. It is a workaround, but it requires to specify the same jar file twice. The first time I use it to actually run the job, and

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole
I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the hive meta database in the same location, so i want to close the derby connection previous created in the spark shell, is there any way to do this?

Re: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Evgeny Sinelnikov
Thank you, Ray, but it is already created and almost fixed: https://issues.apache.org/jira/browse/SPARK-8840 On Wed, Jul 8, 2015 at 4:04 PM, Sun, Rui rui@intel.com wrote: Hi, Evgeny, I reported a JIRA issue for your problem: https://issues.apache.org/jira/browse/SPARK-8897. You can

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Sean. are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do the processing This is basically what

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Brandon White
The point of running them in parallel would be faster creation of the tables. Has anybody been able to efficiently parallelize something like this in Spark? On Jul 8, 2015 12:29 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Whats the point of creating them in parallel? You can multi-thread it

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
@Evo There is no foreachRDD operation on RDDs; it is a method of DStream. It gives each RDD in the stream. RDD has a foreach, and foreachPartition. These give elements of an RDD. What do you mean it 'works' to call foreachRDD on an RDD? @Dmitry are you asking about foreach vs foreachPartition?

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The good boy comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote: Sean already answered your question. foreachRDD and foreachPartition are completely different, there's nothing fuzzy or insufficient

Re: Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
What I seem to be don’t get is how my code ends up being on Worker node. My understanding was that jar file, which I use to start the job should automatically be copied into Worker nodes and added to classpath. It seems to be not the case. But if my jar is not copied into Worker nodes, then how

Re: (de)serialize DStream

2015-07-08 Thread Shixiong Zhu
DStream must be Serializable, it's metadata checkpointing. But you can use KryoSerializer for data checkpointing. The data checkpointing uses RDD.checkpoint which can be set by spark.serializer. Best Regards, Shixiong Zhu 2015-07-08 3:43 GMT+08:00 Chen Song chen.song...@gmail.com: In Spark

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread ayan guha
Do you have a benchmark to say running these two statements as it is will be slower than what you suggest? On 9 Jul 2015 01:06, Brandon White bwwintheho...@gmail.com wrote: The point of running them in parallel would be faster creation of the tables. Has anybody been able to efficiently

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Ashish Dutt
Thanks for your reply Akhil. How do you multithread it? Sincerely, Ashish Dutt On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8,

Re: spark - redshift !!!

2015-07-08 Thread shahab
Sorry, I misunderstood. best, /Shahab On Wed, Jul 8, 2015 at 9:52 AM, spark user spark_u...@yahoo.com wrote: Hi 'I am looking how to load data in redshift . Thanks On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com wrote: Hi, I did some experiment with loading

UDF in spark

2015-07-08 Thread vinod kumar
Hi Everyone, I am new to spark.may I know how to define and use User Define Function in SPARK SQL. I want to use defined UDF by using sql queries. My Environment Windows 8 spark 1.3.1 Warm Regards, Vinod

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
Hi Ashish, I am running ipython notebook server on one of the nodes of the cluster (HDP). Setting it up was quite straightforward, and I guess I followed the same references that you linked to. Then I access the notebook remotely from my development PC. Never tried to connect a local ipython (on

spark benchmarking

2015-07-08 Thread MrAsanjar .
Hi all, What is the most common used tool/product to benchmark spark job?

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One option is the databricks/spark-perf project https://github.com/databricks/spark-perf 2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com: Hi all, What is the most common used tool/product to benchmark spark job?

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread Feynman Liang
A RDD[Double] is an abstraction for a large collection of doubles, possibly distributed across multiple nodes. The DoubleRDDFunctions are there for performing mean and variance calculations across this distributed dataset. In contrast, a Vector is not distributed and fits on your local machine.

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Feynman Liang
Take a look at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann daniel.siegm...@teamaol.com wrote: To set up Eclipse for Spark you should install the Scala IDE plugins:

Re: Restarting Spark Streaming Application with new code

2015-07-08 Thread Vinoth Chandar
Thanks for the clarification, Cody! On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger c...@koeninger.org wrote: You shouldn't rely on being able to restart from a checkpoint after changing code, regardless of whether the change was explicitly related to serialization. If you are relying on

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you are looking for). Can't share the code, but the basic approach is covered in

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so throwing here: AFAIK checkpoints are the only recommended method for running Spark streaming without data loss. But it involves serializing the entire dstream graph, which prohibits any logic

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
You can use DStream.transform for some stuff. Transform takes a RDD = RDD function that allow arbitrary RDD operations to be done on RDDs of a DStream. This function gets evaluated on the driver on every batch interval. If you are smart about writing the function, it can do different stuff at

spark core/streaming doubts

2015-07-08 Thread Shushant Arora
1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation of complex object at each batch interval of a spark streaming app. 2.why JavaStreamingContext 's sc ()

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
Hi TD, you answered a wrong question. If you read the subject, mine was specifically about checkpointing. I'll elaborate The checkpoint, which is a serialized DStream DAG, contains all the metadata and *logic*, like the function passed to e.g. DStream.transform() This is serialized as a

Re: Reading Avro files from Streaming

2015-07-08 Thread harris
Resolved that compilation issue using AvroKey and AvroKeyInputFormat. val avroDs = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) -- View this message in context:

Create RDD from output of unix command

2015-07-08 Thread foobar
What's the best practice of creating RDD from some external unix command output? I assume if the output size is large (say millions of lines), creating RDD from an array of all lines is not a good idea? Thanks! -- View this message in context:

Re: PySpark without PySpark

2015-07-08 Thread Davies Liu
Great post, thanks for sharing with us! On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
Hey Jong, No I did answer the right question. What I explained did not change the JVM classes (that is the function is the same) but it still ensures that computation is different (the filters get updated with time). So you can checkpoint this and recover from it. This is ONE possible way to do

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819 https://issues.apache.org/jira/browse/SPARK-7819 Typically you can work around this by making sure that the classes are shared across the isolation boundary, as discussed in the comments. On Tue, Jul 7, 2015 at 3:29 AM, Sea

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das
Responses inline. On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher
As a distributed data processing engine, Spark should be fine with millions of lines. It's built with the idea of massive data sets in mind. Do you have more details on how you anticipate the output of a unix command interacting with a running Spark application? Do you expect Spark to be

Re: Real-time data visualization with Zeppelin

2015-07-08 Thread Brandon White
Can you use a con job to update it every X minutes? On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – I’m just wondering if anyone has had success integrating Spark Streaming with Zeppelin and actually dynamically updating the data in near real-time.

Re: Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
The error is JVM has not responded after 10 seconds. On 08-Jul-2015 10:54 PM, ayan guha guha.a...@gmail.com wrote: What's the error you are getting? On 9 Jul 2015 00:01, Ashish Dutt ashish.du...@gmail.com wrote: Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two

RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Getting exception when wrting RDD to local disk using following function saveAsTextFile(file:home/someuser/dir2/testupload/20150708/) The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkException: Job aborted

Job completed successfully without processing anything

2015-07-08 Thread ๏̯͡๏
My job completed in 40 seconds that is not correct as there is no output.. I seee Exception in thread main akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@10.115.86.24:54737/), Path(/user/OutputCommitCoordinator)] at

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher
Ah, I see this is streaming. I haven't any practical experience with that side of Spark. But the foreachPartition idea is a good approach. I've used that pattern extensively, even though not for singletons, but just to create non-serializable objects like API and DB clients on the executor side. I

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Feynman Liang
I was thinking the same thing! Try sc.setLogLevel(ERROR) On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson lat...@microsoft.com wrote: “WARN Executor: Told to re-register on heartbeat” is logged repeatedly in the spark shell, which is very distracting and corrupts the display of whatever set

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only affects local mode. I've filed a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-8911 2015-07-08 14:20 GMT-07:00 Lincoln Atkinson lat...@microsoft.com: Brilliant! Thanks. *From:* Feynman Liang

Remote spark-submit not working with YARN

2015-07-08 Thread jegordon
I'm trying to submit a spark job from a different server outside of my Spark Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit script : spark/bin/spark-submit --master yarn-client --executor-memory 4G myjobScript.py The think is that my application never pass from the

Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
WARN Executor: Told to re-register on heartbeat is logged repeatedly in the spark shell, which is very distracting and corrupts the display of whatever set of commands I'm currently typing out. Is there an option to disable the logging of this message? Thanks, -Lincoln

RE: Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
Brilliant! Thanks. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Wednesday, July 08, 2015 2:15 PM To: Lincoln Atkinson Cc: user@spark.apache.org Subject: Re: Disable heartbeat messages in REPL I was thinking the same thing! Try sc.setLogLevel(ERROR) On Wed, Jul 8, 2015 at 2:01 PM,

PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
Hi, I am using MLlib collaborative filtering API on an implicit preference data set. From a pySpark notebook, I am iteratively creating the matrix factorization model with the aim of measuring the RMSE for each combination of parameters for this API like the rank, lambda and alpha. After the code

Re: unable to bring up cluster with ec2 script

2015-07-08 Thread Akhil Das
Its showing connection refused, for some reason it was not able to connect to the machine either its the machine\s start up time or its with the security group. Thanks Best Regards On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I'm following the tutorial

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White bwwintheho...@gmail.com wrote: Say I have a spark job that looks like following: def loadTable1() { val table1 =

Real-time data visualization with Zeppelin

2015-07-08 Thread Ganelin, Ilya
Hi all – I’m just wondering if anyone has had success integrating Spark Streaming with Zeppelin and actually dynamically updating the data in near real-time. From my investigation, it seems that Zeppelin will only allow you to display a snapshot of data, not a continuously updating table. Has

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread ashishdutt
Hi, I get the error, DLL load failed: %1 is not a valid win32 application whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because I am on a windows

Re: Spark query

2015-07-08 Thread Harish Butani
try the spark-datetime package: https://github.com/SparklineData/spark-datetime Follow this example https://github.com/SparklineData/spark-datetime#a-basic-example to get the different attributes of a DateTime. On Wed, Jul 8, 2015 at 9:11 PM, prosp4300 prosp4...@163.com wrote: As mentioned in

Re: Using Hive UDF in spark

2015-07-08 Thread ayan guha
You are most likely confused because you are using the UDF using HiveContext. In your case, you are using Spark UDF, not Hive UDF. For a naive scenario, I can use spark UDFs without any hive installation in my cluster. sqlContext.udf.register is for UDF in spark. Hive UDFs are stored in Hive and

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread canan chen
following function saveAsTextFile(file:home/someuser/dir2/testupload/20150708/) The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread Ashish Dutt
Hi, I get the error, DLL load failed: %1 is not a valid win32 application whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because I am on a windows

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Ashish, Nice post. Agreed, kudos to the author of the post, Benjamin Benfort of District Labs. Following your post, I get this problem; Again, not my post. I did try setting up IPython with the Spark profile for the edX Intro to Spark course (because I didn't want to use the Vagrant

Spark query

2015-07-08 Thread Ravisankar Mani
Hi everyone, I can't get 'day of year' when using spark query. Can you help any way to achieve day of year? Regards, Ravi

Re:Spark query

2015-07-08 Thread prosp4300
As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs, please take a look below builtin UDFs of Hive, get day of year should be as simply as existing RDBMS https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions At 2015-07-09

Re: Writing data to hbase using Sparkstreaming

2015-07-08 Thread Ted Yu
bq. return new Tuple2ImmutableBytesWritable, Put(new ImmutableBytesWritable(), put); I don't think Put is serializable. FYI On Fri, Jun 12, 2015 at 6:40 AM, Vamshi Krishna vamshi2...@gmail.com wrote: Hi I am trying to write data that is

Re: Spark query

2015-07-08 Thread Brandon White
Convert the column to a column of java Timestamps. Then you can do the following import java.sql.Timestamp import java.util.Calendar def date_trunc(timestamp:Timestamp, timeField:String) = { timeField match { case hour = val cal = Calendar.getInstance()

Re: Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread Ted Yu
Doesn't seem to be Spark problem, assuming TDigest comes from mahout. Cheers On Wed, Jul 8, 2015 at 7:49 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Same exception with different values of compression (10,100) var digest: TDigest = TDigest.createAvlTreeDigest(100) On Wed, Jul 8, 2015 at

SparkR dataFrame read.df fails to read from aws s3

2015-07-08 Thread Ben Spark
I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df method cannot load data from aws s3. 1) read.df error message read.df(sqlContext,s3://some-bucket/some.json,json) 15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt
Hi Sujit, Thanks for your response. So i opened a new notebook using the command ipython notebook --profile spark and tried the sequence of commands. i am getting errors. Attached is the screenshot of the same. Also I am attaching the 00-pyspark-setup.py for your reference. Looks like, I have

Re: Problem in Understanding concept of Physical Cores

2015-07-08 Thread Tathagata Das
There are several levels of indirection going on here, let me clarify. In the local mode, Spark runs tasks (which includes receivers) using the number of threads defined in the master (either local, or local[2], or local[*]). local or local[1] = single thread, so only one task at a time local[2]

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Tathagata Das
This is also discussed in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Sean. are you asking about foreach vs

What does RDD lineage refer to ?

2015-07-08 Thread canan chen
Lots of places refer RDD lineage, I'd like to know what it refer to exactly. My understanding is that it means the RDD dependencies and the intermediate MapOutput info in MapOutputTracker. Correct me if I am wrong. Thanks

  1   2   >