Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
>> Why are you checkpointing the direct kafka stream? It serves not purpose. Could you elaborate on what you mean? Our goal is fault tolerance. If a consumer is killed or stopped midstream, we want to resume where we left off next time the consumer is restarted. How would that be "not surving a

RE: Spark ANN

2015-09-08 Thread Ulanov, Alexander
That is an option too. Implementing convolutions with FFTs should be considered as well http://arxiv.org/pdf/1312.5851.pdf. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Tuesday, September 08, 2015 12:07 PM To: Ulanov, Alexander Cc: Ruslan Dautkhanov; Nick Pentreath; user; na...@yandex

Re: Partitions with zero records & variable task times

2015-09-08 Thread mark
As I understand things (maybe naively), my input data are stored in equal sized blocks in HDFS, and each block represents a partition within Spark when read from HDFS, therefore each block should hold roughly the same number of records. So something is missing in my understanding - what can cause

Re: Support of other languages?

2015-09-08 Thread Nagaraj Chandrashekar
Hi Rahul, I may not have the answer for what you are looking for but my thoughts are given below. I have worked with HP Vertica and R VIA UDF¹s (User Defined Functions). I don¹t have any experience with Spark R till now. I would expect it might follow the similar route. UDF functions containi

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Tathagata Das
Calling directKafkaStream.checkpoint() will make the system write the raw kafka data into HDFS files (that is, RDD checkpointing). This is completely unnecessary with Direct Kafka because it already tracks the offset of data in each batch (which checkpoint is enabled using streamingContext.checkpoi

Event logging not working when worker machine terminated

2015-09-08 Thread David Rosenstrauch
Our Spark cluster is configured to write application history event logging to a directory on HDFS. This all works fine. (I've tested it with Spark shell.) However, on a large, long-running job that we ran tonight, one of our machines at the cloud provider had issues and had to be terminated

Re: Event logging not working when worker machine terminated

2015-09-08 Thread Jeff Zhang
What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch wrote: > Our Spark cluster is configured to write application history event logging > to a directory on HDFS. This all works fine. (I've tested it with Spark > shell.) > > However, on a

Re: Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-08 Thread Tathagata Das
Probably, the problem here is that the recovered StreamingContext is trying to refer to the pre-failure static RDD, which does exist after the failure. The solution: When the driver process restarts from checkpoint, you need to recreate the static RDD again explicitly, and make that the recreated R

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
That is good to know. However, that doesn't change the problem I'm seeing. Which is that, even with that piece of code commented out (stream.checkpoint()), the batch duration millis aren't getting changed unless I take checkpointing completely out. In other words, this commented out: //if (pa

Re: Getting Started with Spark

2015-09-08 Thread Tathagata Das
This is a known issue introduced in Spark 1.4.1 and 1.5.0 and will be fixed in Spark 1.5.1. In the mean time, you could prototype in Spark 1.4.0 and wait for Spark 1.5.1/1.4.2 to come out. You could also download the source code and compile the Spark master branch. https://issues.apache.org/jira/br

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Tathagata Das
Well, you are returning JavaStreamingContext.getOrCreate(params. getCheckpointDir(), factory); That is loading the checkpointed context, independent of whether params .isCheckpointed() is true. On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg wrote: > That is good to know. However, that does

Re: Applying transformations on a JavaRDD using reflection

2015-09-08 Thread Nirmal Fernando
Any thoughts? On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando wrote: > Hi All, > > I'd like to apply a chain of Spark transformations (map/filter) on a given > JavaRDD. I'll have the set of Spark transformations as Function, and > even though I can determine the classes of T and A at the runtime

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
What's wrong with creating a checkpointed context?? We WANT checkpointing, first of all. We therefore WANT the checkpointed context. Second of all, it's not true that we're loading the checkpointed context independent of whether params.isCheckpointed() is true. I'm quoting the code again: // T

Contribution in Apche Spark

2015-09-08 Thread Chintan Bhatt
I want to contribute in Apache dspark especially in MLlib in Spark. pls suggest me any open issue/use case/problem. -- CHINTAN BHATT Assistant Professor, U & P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of Techn

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Jörn Franke
What do you mean by import? All ways have advantages and disadvantages. You may first think about when you can make large extractions of data from the database into Spark. You also may think about if the database should be the persistent storage of the data or if you need something aside of the dat

Re: Best way to import data from Oracle to Spark?

2015-09-08 Thread Ruslan Dautkhanov
You can also sqoop oracle data in $ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY http://www.rittmanmead.com/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/ -- Ruslan Dautkhanov On Tue, Sep

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Alexander Pivovarov
The problem which we have now is skew data (2360 tasks done in 5 min, 3 tasks in 40 min and 1 task in 2 hours) Some people from the team worry that the executor which runs the longest task can be killed by YARN (because executor might be unresponsive because of GC or it might occupy more memory th

Task serialization error for mllib.MovieLensALS

2015-09-08 Thread Jeff Zhang
I run the MovieLensALS, but meet the following error. The weird thing is that this issue only appear under openjdk. And this is based on the 1.5, I found several related tickets, not sure has anyone else meet the same issue and know the solution ? Thanks https://issues.apache.org/jira/browse/SPARK

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-08 Thread Terry Hole
Sean, Thank you! Finally, I get this to work, although it is a bit ugly: manually to set the meta data of dataframe. import org.apache.spark.ml.attribute._ import org.apache.spark.sql.types._ val df = training.toDF() val schema = df.schema val rowRDD = df.rdd def enrich(m : Metadata) : Metadata

Re: Partitions with zero records & variable task times

2015-09-08 Thread Akhil Das
This post here has a bit information http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/ Thanks Best Regards On Wed, Sep 9, 2015 at 6:44 AM, mark wrote: > As I understand things (maybe naively), my input data are stored in equa

Re: No auto decompress in Spark Java textFile function?

2015-09-08 Thread Akhil Das
textFile used to work with .gz files, i haven't tested it on bz2 files. If it isn't decompressing by default then what you have to do is to use the sc.wholeTextFiles and then decompress each record (that being file) with the corresponding codec. Thanks Best Regards On Tue, Sep 8, 2015 at 6:49 PM,

<    1   2