Setup/Cleanup for RDD closures?

2014-10-02 Thread Stephen Boesch
Consider there is some connection / external resource allocation required to be accessed/mutated by each of the rows from within a single worker thread. That connection should only be opened/closed before the first row is accessed / after the last row is completed. It is my understanding that th

How to make ./bin/spark-sql work with hive?

2014-10-02 Thread Li HM
I have rebuild package with -Phive Copied hive-site.xml to conf (I am using hive-0.12) When I run ./bin/spark-sql, I get java.lang.NoSuchMethodError for every command. What am I missing here? Could somebody share what would be the right procedure to make it work? java.lang.NoSuchMethodError: or

Re: Getting table info from HiveContext

2014-10-02 Thread Banias
Thanks Michael. On Thursday, October 2, 2014 8:41 PM, Michael Armbrust wrote: We actually leave all the DDL commands up to hive, so there is no programatic way to access the things you are looking for. On Thu, Oct 2, 2014 at 5:17 PM, Banias wrote: Hi, > > >Would anybody know how to get

Re: HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Cheng Lian
Cache table works with partitioned table. I guess you’re experimenting with a default local metastore and the metastore_db directory doesn’t exist at the first place. In this case, all metastore tables/views don’t exist at first and will throw the error message you saw when the |PARTITIONS| me

Re: new error for me

2014-10-02 Thread jamborta
have you found a solution this problem? (or at least a cause) thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Any issues with repartition?

2014-10-02 Thread jamborta
Hi Arun, Have you found a solution? Seems that I have the same problem. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
This is hard to do in general, but you can get what you are asking for by putting the following class in scope. implicit class BetterRDD[A: scala.reflect.ClassTag](rdd: org.apache.spark.rdd.RDD[A]) { def dropOne = rdd.mapPartitionsWithIndex((i, iter) => if(i == 0 && iter.hasNext) { iter.next; it

Re: Getting table info from HiveContext

2014-10-02 Thread Michael Armbrust
We actually leave all the DDL commands up to hive, so there is no programatic way to access the things you are looking for. On Thu, Oct 2, 2014 at 5:17 PM, Banias wrote: > Hi, > > Would anybody know how to get the following information from HiveContext > given a Hive table name? > > - partition

Re: Load multiple parquet file as single RDD

2014-10-02 Thread Michael Armbrust
parquetFile accepts a comma separated list of files. Also, unionAll does not write to disk. However, unless you are running a recent version (compiled from master since this was added ) its missing an optimization an

Load multiple parquet file as single RDD

2014-10-02 Thread Mohnish Kodnani
Hi, I am trying to play around with Spark and Spark SQL. I have logs being stored in HDFS on a 10 minute window. Each 10 minute window could have as many as 10 files with random names of 2GB each. Now, I want to run some analysis on these files. These files are parquet files. I am trying to run Spa

Getting table info from HiveContext

2014-10-02 Thread Banias
Hi, Would anybody know how to get the following information from HiveContext given a Hive table name? - partition key(s) - table directory - input/output format I am new to Spark. And I have a couple tables created using Parquet data like: CREATE EXTERNAL TABLE parquet_table ( COL1 string, COL

Re: Help Troubleshooting Naive Bayes

2014-10-02 Thread Sandy Ryza
Those logs you included are from the Spark executor processes, as opposed to the YARN NodeManager processes. If you don't think you have access to the NodeManager logs, I would try setting spark.yarn.executor.memoryOverhead to something like 1024 or 2048 and seeing if that helps. If it does, it's

Re: Sorting a Sequence File

2014-10-02 Thread bhusted
Here is the code in question //read in the hadoop sequence file to sort val file = sc.sequenceFile(input, classOf[Text], classOf[Text]) //this is the code we would like to avoid that maps the Hadoop Text Input to Strings so the sortyByKey will run file.map{ case (k,v) => (k.toString(), v.to

how to debug ExecutorLostFailure

2014-10-02 Thread jamborta
hi all, I have a job that runs about for 15 mins, at some point I get an error on both nodes (all executors) saying: 14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253, backend-tes): ExecutorLostFailure (executor lost) In the end, it seems that the job recovers and compl

Re: Strategies for reading large numbers of files

2014-10-02 Thread Nicholas Chammas
I believe this is known as the "Hadoop Small Files Problem", and it affects Spark as well. The best approach I've seen to merging small files like this is by using s3distcp, as suggested here , as a pre-processi

Strategies for reading large numbers of files

2014-10-02 Thread Landon Kuhn
Hello, I'm trying to use Spark to process a large number of files in S3. I'm running into an issue that I believe is related to the high number of files, and the resources required to build the listing within the driver program. If anyone in the Spark community can provide insight or guidance, it w

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Sunny Khatri
You can do filter with startswith ? On Thu, Oct 2, 2014 at 4:04 PM, SK wrote: > Thanks for the help. Yes, I did not realize that the first header line has > a > different separator. > > By the way, is there a way to drop the first line that contains the header? > Something along the following li

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK
Thanks for the help. Yes, I did not realize that the first header line has a different separator. By the way, is there a way to drop the first line that contains the header? Something along the following lines: sc.textFile(inp_file) .drop(1) // or tail() to drop the header line

Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei Date: Thu, Oct 2, 2014 at 3:42 PM Subject: Re: Spark SQL: ArrayIndexOutofBoundsException To: SK There is only one place you use index 1. One possible issue is that your may have only one element after your split by "\t". Can you try to r

Re: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
The bug is likely in your data. Do you have lines in your input file that do not contain the "\t" character? If so .split will only return a single element and p(1) from the .map() is going to throw java.lang. ArrayIndexOutOfBoundsException: 1 On Thu, Oct 2, 2014 at 3:35 PM, SK wrote: > Hi, >

Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread SK
Hi, I am trying to extract the number of distinct users from a file using Spark SQL, but I am getting the following error: ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 15) java.lang.ArrayIndexOutOfBoundsException: 1 I am following the code in examples/sql/RDDRelation.scala. My co

Re: SparkSQL DataType mappings

2014-10-02 Thread Costin Leau
Hi Yin, Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark SQL [1] Cheers, [1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html On 10/2/14 6:32 PM, Yin Huai wrote: Hi Costin, I am answering yo

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Mark Mandel
On 2 October 2014 21:38, Marius Soutier wrote: > > On 02.10.2014, at 13:32, Mark Mandel wrote: > > How do I store a JAR on a cluster? Is that through storm-submit with a > deploy mode of "cluster” ? > > > Well, just upload it? scp, ftp, and so on. Ideally your build server would > put it there.

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Ok Krishna Sankar, In relation to this information on Spark monitoring webpage, "For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile" Do you know what I need to do to install with sbt? Thanks. -- View this mes

HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Du Li
Hi, In Spark 1.1 HiveContext, I ran a create partitioned table command followed by a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View 'PARTITIONS' does not exist. But cache table worked fine if the table is not a partitioned table. Can anybody confirm that cache of pa

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers On Thu, Oct 2, 2014 at 12:29 PM, danilopds wrote: > Hi tsingfu, > > I want to see me

Sorting a Sequence File

2014-10-02 Thread jritz
All, I am having trouble getting a sequence file sorted. My sequence file is (Text, Text) and when trying to sort it, Spark complains that it can not because Text is not serializable. To get around this issue, I performed a map on the sequence file to turn it into (String, String). I then perfo

Block removal causes Akka timeouts

2014-10-02 Thread maddenpj
I'm seeing a lot of Akka timeouts which eventually lead to job failure in spark streaming when removing blocks (Example stack trace below). It appears to be related to these issues: SPARK-3015 and SPARK-3139

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Hi tsingfu, I want to see metrics in ganglia too. But I don't understand this step: ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive -Pspark-ganglia-lgpl Are you installing the hadoop, yarn, hive AND ganglia?? If I want to install just ganglia? Can you suggest me somethi

Re: GraphX Java API Timeline

2014-10-02 Thread Adams, Jeremiah
Thank you Ankur. Is there a branch you are working out of in github? *Jeremiah Adams* On Thu, Oct 2, 2014 at 1:02 PM, Ankur Dave wrote: > Yes, I'm working on a Java API for Spark 1.2. Here's the issue to track > progress: https://issues.apache.org/jira/browse/SPARK-3665 > > Ankur

Re: GraphX Java API Timeline

2014-10-02 Thread Ankur Dave
Yes, I'm working on a Java API for Spark 1.2. Here's the issue to track progress: https://issues.apache.org/jira/browse/SPARK-3665 Ankur On Thu, Oct 2, 2014 at 11:10 AM, Adams, Jeremiah wrote: > Are there any plans to create a java api for GraphX? If so, what is the

Re: Application details for failed and teminated jobs

2014-10-02 Thread Marcelo Vanzin
You may want to take a look at this PR: https://github.com/apache/spark/pull/1558 Long story short: while not a terrible idea to show running applications, your particular case should be solved differently. Applications are responsible for calling "SparkContext.stop()" at the end of their run, cur

Application details for failed and teminated jobs

2014-10-02 Thread SK
Hi, Currently the history server provides application details for only the successfully completed jobs (where the APPLICATION_COMPLETE file is generated). However, (long-running) jobs that we terminate manually or failed jobs where the APPLICATION_COMPLETE may not be generated, dont show up on th

Re: Issue with Partitioning

2014-10-02 Thread Liquan Pei
Hi Ankur, When sortByKey() is executed, it performs a range partition of key. After execution, each partition contains sorted range of elements in your key which means that different keys may ends up in different partitions. Liquan On Thu, Oct 2, 2014 at 11:09 AM, Ankur Srivastava < ankur.srivas

GraphX Java API Timeline

2014-10-02 Thread Adams, Jeremiah
Are there any plans to create a java api for GraphX? If so, what is the status? Is someone leading this effort that I can contact? Thanks. *Jeremiah Adams*

Re: Issue with Partitioning

2014-10-02 Thread Ankur Srivastava
Hi All, I got the past the first problem where now I am able to create a partition with keys only having same sub-strings in one partition. I was able to get that by adjusting the worker thread numbers to greater than 1 as I am running the application from eclipse on localhost. But the issue with

Re: timestamp not implemented yet

2014-10-02 Thread Michael Armbrust
That is a pretty reasonable workaround. Also, please feel free to file a JIRA when you find gaps in functionality like this that are impacting your workloads: https://issues.apache.org/jira/browse/SPARK/ On Wed, Oct 1, 2014 at 5:09 PM, barge.nilesh wrote: > Parquet format seems to be comparati

Re: Is there a way to provide individual property to each Spark executor?

2014-10-02 Thread Andrew Or
Hi Vladimir, This is not currently supported, but users have asked for it in the past. I have filed an issue for it here: https://issues.apache.org/jira/browse/SPARK-3767 so we can track its progress. Andrew 2014-10-02 5:25 GMT-07:00 Vladimir Tretyakov < vladimir.tretya...@sematext.com>: > Hi,

Re: Implicit conversion RDD -> SchemaRDD

2014-10-02 Thread Michael Armbrust
You need to define the case class outside of your method. Otherwise the scala compiler implicitly adds a pointer to the containing scope to your class which confuses things. On Thu, Oct 2, 2014 at 7:20 AM, Stephen Boesch wrote: > > Here is the specific code > > > val sc = new SparkContext(s

Re: [SparkSQL] Function parity with Shark?

2014-10-02 Thread Michael Armbrust
What are the errors you are seeing. All of those functions should work. On Thu, Oct 2, 2014 at 6:56 AM, Yana Kadiyska wrote: > Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC > server that comes with Spark 1.1.0. > > However I observed that conditional functions do not

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or
Ah I see, you were running in yarn cluster mode so the logs are the same. Glad you figured it out. 2014-10-02 10:31 GMT-07:00 Greg Hill : > So, I actually figured it out, and it's all my fault. I had an older > version of spark on the datanodes and was passing > in spark.executor.extraClassPath

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Greg Hill
So, I actually figured it out, and it's all my fault. I had an older version of spark on the datanodes and was passing in spark.executor.extraClassPath to pick it up. It was a holdover from some initial work before I got everything working right. Once I removed that, it picked up the spark JA

Larger heap leads to perf degradation due to GC

2014-10-02 Thread Mingyu Kim
This issue definitely needs more investigation, but I just wanted to quickly check if anyone has run into this problem or has general guidance around it. We¹ve seen a performance degradation with a large heap on a simple map task (I.e. No shuffle). We¹ve seen the slowness starting around from 50GB

Fwd: Second Bay Area Tachyon meetup: October 21st, hosted by Pivotal (Limited Space)

2014-10-02 Thread Haoyuan Li
-- Forwarded message -- From: Haoyuan Li Date: Thu, Oct 2, 2014 at 10:12 AM Subject: Second Bay Area Tachyon meetup: October 21st, hosted by Pivotal (Limited Space) To: tachyon-us...@googlegroups.com Hi folks, We've posted the second Tachyon meetup featuring exciting updates and

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or
Hi Greg, Have you looked at the AM container logs? (You may already know this, but) you can get these through the RM web UI or through: yarn logs -applicationId If an AM throws an exception then the executors may not be started properly. -Andrew 2014-10-02 9:47 GMT-07:00 Greg Hill : > I h

Re: Type problem in Java when using flatMapValues

2014-10-02 Thread Sean Owen
Eh, is it not that you are mapping the values of an RDD whose keys are StringStrings, but expecting the keys are Strings? That's also about what the compiler is saying too. On Thu, Oct 2, 2014 at 4:15 PM, Robin Keunen wrote: > Hi all, > > I successfully implemented my algorithm in Scala but my te

weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Greg Hill
I haven't run into this until today. I spun up a fresh cluster to do some more testing, and it seems that every single executor fails because it can't connect to the driver. This is in the YARN logs: 14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp:

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-02 Thread maddenpj
I am seeing this same issue. Bumping for visibility. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-job-has-an-issue-when-the-worker-reading-from-Kafka-is-killed-tp12595p15611.html Sent from the Apache Spark User List mailing list arch

Re: SparkSQL DataType mappings

2014-10-02 Thread Yin Huai
Hi Costin, I am answering your questions below. 1. You can find Spark SQL data type reference at here . It explains the underlying data type for a Spark SQL data type for Scala, Java, and Python APIs. For

Type problem in Java when using flatMapValues

2014-10-02 Thread Robin Keunen
Hi all, I successfully implemented my algorithm in Scala but my team wants it in Java. I have a problem with Generics, can anyone help me? I have a first JavaPairRDD with a structure like ((ean, key), [from, to, value]) * ean and key are string * from and to are DateTime * value is a Dou

Re: partition size for initial read

2014-10-02 Thread Yin Huai
Hi Tamas, Can you try to set mapred.map.tasks and see if it works? Thanks, Yin On Thu, Oct 2, 2014 at 10:33 AM, Tamas Jambor wrote: > That would work - I normally use hive queries through spark sql, I > have not seen something like that there. > > On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain

Re: partition size for initial read

2014-10-02 Thread Tamas Jambor
That would work - I normally use hive queries through spark sql, I have not seen something like that there. On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain wrote: > If you are using textFiles() to read data in, it also takes in a parameter > the number of minimum partitions to create. Would that not

Re: Implicit conversion RDD -> SchemaRDD

2014-10-02 Thread Stephen Boesch
Here is the specific code val sc = new SparkContext(s"local[$NWorkers]", "HBaseTestsSparkContext") val ctx = new SQLContext(sc) import ctx._ case class MyTable(col1: String, col2: Byte) val myRows = ctx.sparkContext.parallelize((Range(1,21).map{ix => MyTable(s"col1$ix"

Re: partition size for initial read

2014-10-02 Thread Ashish Jain
If you are using textFiles() to read data in, it also takes in a parameter the number of minimum partitions to create. Would that not work for you? On Oct 2, 2014 7:00 AM, "jamborta" wrote: > Hi all, > > I have been testing repartitioning to ensure that my algorithms get similar > amount of data.

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Ashish Jain
Hello Mark, I am no expert but I can answer some of your questions. On Oct 2, 2014 2:15 AM, "Mark Mandel" wrote: > > Hi, > > So I'm super confused about how to take my Spark code and actually deploy and run it on a cluster. > > Let's assume I'm writing in Java, and we'll take a simple example su

partition size for initial read

2014-10-02 Thread jamborta
Hi all, I have been testing repartitioning to ensure that my algorithms get similar amount of data. Noticed that repartitioning is very expensive. Is there a way to force Spark to create a certain number of partitions when the data is read in? How does it decided on the partition size initially?

[SparkSQL] Function parity with Shark?

2014-10-02 Thread Yana Kadiyska
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC server that comes with Spark 1.1.0. However I observed that conditional functions do not work (I tried 'case' and 'coalesce') some string functions like 'concat' also did not work. Is there a list of what's missing or a ro

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
You don't need to do anything special to run in local mode from within Eclipse. Just create a simple SparkConf and create a SparkContext from that. I have unit tests which execute on a local SparkContext, and they work from inside Eclipse as well as SBT. val conf = new SparkConf().setMaster("local

Re: Help Troubleshooting Naive Bayes

2014-10-02 Thread Mike Bernico
Hello Xiangrui and Sandy, Thanks for jumping in to help. So, first thing... After my email last night I reran my code using 10 executors, 2G each, and everything ran okay. So, that's good, but I'm still curious as to what I was doing wrong. For Xiangrui's questions: My training set is 49174

Re: Spark Streaming for time consuming job

2014-10-02 Thread Eko Susilo
Hi Mayur, Thanks for your suggestion. In fact, that's i'm thinking about; to pass those data, and return only the percentage of the outlier in a particular window. I also have some doubt if i would implement the outlier detection on rdd as you have suggested. >From what i understand that those

Re: persistent state for spark streaming

2014-10-02 Thread Yana Kadiyska
Yes -- persist is more akin to caching -- it's telling Spark to materialize that RDD for fast reuse but it's not meant for the end user to query/use across processes, etc.(at least that's my understanding). On Thu, Oct 2, 2014 at 4:04 AM, Chia-Chun Shih wrote: > Hi Yana, > > So, user quotas need

Is there a way to provide individual property to each Spark executor?

2014-10-02 Thread Vladimir Tretyakov
Hi, here in Sematext we almost done with Spark monitoring http://www.sematext.com/spm/index.html But we need 1 thing from Spark, something like https://groups.google.com/forum/#!topic/storm-user/2fNCF341yqU in Storm. Something like 'placeholder' in java opts which Spark will fills for executor, w

Re: Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Marius Soutier
On 02.10.2014, at 13:32, Mark Mandel wrote: > How do I store a JAR on a cluster? Is that through storm-submit with a deploy > mode of "cluster” ? Well, just upload it? scp, ftp, and so on. Ideally your build server would put it there. > How do I run an already uploaded JAR with spark-submi

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Andras Barjak
i used this solution to get the class name correctly at runtime: kryo.register(ClassTag(Class.forName("org.apache.spark.util.collection.CompactBuffer")).wrap.runtimeClass) 2014-10-02 12:50 GMT+02:00 Daniel Darabos : > How about this? > > Class.forName("[Lorg.apache.spark.util.collection.Co

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Daniel Darabos
How about this? Class.forName("[Lorg.apache.spark.util.collection.CompactBuffer;") On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak wrote: > Hi, > > what is the correct scala code to register an Array of this private spark > class to Kryo? > > "java.lang.IllegalArgumentException: Class is not

how to send message to specific vertex by Pregel api

2014-10-02 Thread Yifan LI
Hi, Is there anyone having clue of sending messages to specific vertex(not to immediate neighbour), whose vId is stored in property of source vertex, in Pregel api? More precisely, how to do this in sendMessage() ? to pass more general Triplets into above function? (Obviously we can do it usin

Confusion over how to deploy/run JAR files to a Spark Cluster

2014-10-02 Thread Mark Mandel
Hi, So I'm super confused about how to take my Spark code and actually deploy and run it on a cluster. Let's assume I'm writing in Java, and we'll take a simple example such as: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaLogQuery.java, and thi

Implicit conversion RDD -> SchemaRDD

2014-10-02 Thread Stephen Boesch
I am noticing disparities in behavior between the REPL and in my standalone program in terms of implicit conversion of an RDD to SchemaRDD. In the REPL the following sequence works: import sqlContext._ val mySchemaRDD = myNormalRDD.where("1=1") However when attempting similar in a standalone

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-02 Thread Sean Owen
This looks like you are just running your own program. To run Spark programs, you use spark-submit. It has options that control the executor and driver memory. The settings below are not affecting Spark. On Wed, Oct 1, 2014 at 10:21 PM, 陈韵竹 wrote: > Thanks Sean. This is how I set this memory. I s

Re: persistent state for spark streaming

2014-10-02 Thread Chia-Chun Shih
Hi Yana, So, user quotas need another data store, which can guarantee persistence and afford frequent data updates/access. Is it correct? Thanks, Chia-Chun 2014-10-01 21:48 GMT+08:00 Yana Kadiyska : > I don't think persist is meant for end-user usage. You might want to call > saveAsTextFiles, f

Re: any code examples demonstrating spark streaming applications which depend on states?

2014-10-02 Thread Chia-Chun Shih
Hi Yana, Thanks for your kindly response. My question is indeed unclear. What I wanna do is to join a state stream, which is the *updateStateByKey *output of last-run. *updateStateByKey *is useful if application logic doesn't (heavily) rely on states. So that you can run application without know

Re: What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-02 Thread Sean Owen
Yes, the problem is that the Java API inadvertently requires an Iterable return value, not an Iterator: https://issues.apache.org/jira/browse/SPARK-3369 I think this can't be fixed until Spark 2.x. It seems possible to cheat and return a wrapper like the "IteratorIterable" I posted in the JIRA. Yo

Re: Help Troubleshooting Naive Bayes

2014-10-02 Thread Sandy Ryza
Hi Mike, Do you have access to your YARN NodeManager logs? When executors die randomly on YARN, it's often because they use more memory than allowed for their YARN container. You would see messages to the effect of "container killed because physical memory limits exceeded". -Sandy On Wed, Oct