Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Yea I think this is where the heuristics is failing -- it uses 8 cores to approximate the number of active tasks, but the tests somehow is using 32 (maybe because it explicitly sets it to that, or you set it yourself? I'm not sure which one) On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins wrote:

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Pete Robbins
Reynold, thanks for replying. getPageSize parameters: maxMemory=515396075, numCores=0 Calculated values: cores=8, default=4194304 So am I getting a large page size as I only have 8 cores? On 15 September 2015 at 00:40, Reynold Xin wrote: > Pete - can you do me a favor? > > > https://github.com

Re: And.eval short circuiting

2015-09-14 Thread Reynold Xin
rxin=# select null and true; ?column? -- (1 row) rxin=# select null and false; ?column? -- f (1 row) null and false should return false. On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson wrote: > It seems like And.eval can avoid calculating right.eval if left.eval > returns n

And.eval short circuiting

2015-09-14 Thread Zack Sampson
It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is? override def eval(input: Row): Any = { val l = left.eval(input) if (l == false) { false } else { val r = right.eval(input) if (r == false) { false

Re: Enum parameter in ML

2015-09-14 Thread Feynman Liang
We usually write a Java test suite which exercises the public API (e.g. DCT ). It may be possible to create a sealed trait with singleton concrete instances inside of a serializable

RE: Enum parameter in ML

2015-09-14 Thread Ulanov, Alexander
Hi Feynman, Thank you for suggestion. How can I ensure that there will be no problems for Java users? (I only use Scala API) Best regards, Alexander From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 5:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject

Re: Enum parameter in ML

2015-09-14 Thread Feynman Liang
Since PipelineStages are serializable, the params must also be serializable. We also have to keep the Java API in mind. Introducing a new enum Param type may work, but we will have to ensure that Java users can use it without dealing with ClassTags (I believe Scala will create new types for each po

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Pete - can you do me a favor? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 Print the parameters that are passed into the getPageSize function, and check their values. On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin wrote:

RDD API patterns

2015-09-14 Thread sim
I'd like to get some feedback on an API design issue pertaining to RDDs. The design goal to avoid RDD nesting, which I agree with, leads the methods operating on subsets of an RDD (not necessarily partitions) to use Iterable as an abstraction. The mapPartitions and groupBy* family of methods are

Enum parameter in ML

2015-09-14 Thread Ulanov, Alexander
Dear Spark developers, I am currently implementing the Estimator in ML that has a parameter that can take several different values that are mutually exclusive. The most appropriate type seems to be Scala Enum (http://www.scala-lang.org/api/current/index.html#scala.Enumeration). However, the cu

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Is this on latest master / branch-1.5? out of the box we reserve only 16% (0.2 * 0.8) of the memory for execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at least one page for execution. If your page s

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
A scale of 10 means that there are 10 digits at the right of the decimal point. If you also have precision 10, the range of your data will be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, which is expected. On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho < dirceu.semig

Re: JDBC Dialect tests

2015-09-14 Thread Reynold Xin
SPARK-9818 you link to actually links to a pull request trying to bring them back. On Mon, Sep 14, 2015 at 1:34 PM, Luciano Resende wrote: > I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that > supposedly is testing MySQL and PostgreSQL using Docker and it seems that > this

Null Value in DecimalType column of DataFrame

2015-09-14 Thread Dirceu Semighini Filho
Hi all, I'm moving from spark 1.4 to 1.5, and one of my tests is failing. It seems that there was some changes in org.apache.spark.sql.types. DecimalType This ugly code is a little sample to reproduce the error, don't use it into your project. test("spark test") { val file = context.sparkConte

Re: JavaRDD using Reflection

2015-09-14 Thread Ankur Srivastava
It is not reflection that is the issue here but use of an RDD transformation "featureKeyClassPair.map" inside "lines.mapToPair". >From the code snippet you have sent it is not very clear if getFeatureScore(id,data) invokes executeFeedFeatures, but if that is the case it is not very obvious that “d

JDBC Dialect tests

2015-09-14 Thread Luciano Resende
I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that supposedly is testing MySQL and PostgreSQL using Docker and it seems that this code has been removed. Could anyone provide me a pointer on where are these tests actually located at the moment, and how they are integrated with th

Spark 1.5.1 release

2015-09-14 Thread Reynold Xin
Hi devs, FYI - we have already accumulated an "interesting" list of issues found with the 1.5.0 release. I will work on an RC in the next week or two, depending on how many blocker/critical issues are fixed. https://issues.apache.org/jira/issues/?filter=1221

Re: JavaRDD using Reflection

2015-09-14 Thread Ajay Singal
Hello Rachana, The easiest way would be to start with creating a 'parent' JavaRDD and run different filters (based on different input arguments) to create respective 'child' JavaRDDs dynamically. Notice that the creation of these children RDDs is handled by the application driver. Hope this help

RE: ML: embed a transformer

2015-09-14 Thread Saif.A.Ellafi
Thank you, I will do as you suggested. Ps: I read that in this random user archive I found: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3c55709f7b.2090...@gmail.com%3E Saif From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 4:08 PM To: Ell

Re: ML: embed a transformer

2015-09-14 Thread Feynman Liang
Where did you read that it should be public? The traits in ml.param.shared are meant to be used across internal spark.ml transformer implementations. If your transformer could be included in spark.ml, then I would recommend implementing it there so these package private traits can be reused. Other

Fwd: JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-14 Thread Juan Rodríguez Hortalá
Hi, I sent this message to the user list a few weeks ago with no luck, so I'm forwarding it to the dev list in case someone could give a hand with this. Thanks a lot in advance I've developed a ScalaCheck property for testing Spark Streaming transformations. To do that I had to develop a custom I

ML: embed a transformer

2015-09-14 Thread Saif.A.Ellafi
Hi all, I'm very new to spark and looking forward to get deep into the topic. Right now I am trying to inherit my own transformer, by what I am reading so far, it is not very public that we can apply to this practice as "users". I am defining my transformer based on the Binarizer, but simply fail

RE: Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Thank you for quick response! I’ll use Tuple1 From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 11:05 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Data frame with one column For an example, see the ml-feature word2vec user guide

Re: Data frame with one column

2015-09-14 Thread Feynman Liang
You could use `Tuple1(x)` instead of `Hack` On Mon, Sep 14, 2015 at 10:50 AM, Ulanov, Alexander < alexander.ula...@hpe.com> wrote: > Dear Spark developers, > > > > I would like to create a dataframe with one column. However, the > createDataFrame method accepts at least a Product: > > > > val dat

Re: Data frame with one column

2015-09-14 Thread Feynman Liang
For an example, see the ml-feature word2vec user guide On Mon, Sep 14, 2015 at 11:03 AM, Feynman Liang wrote: > You could use `Tuple1(x)` instead of `Hack` > > On Mon, Sep 14, 2015 at 10:50 AM, Ulanov, Alexander < > alexander.ula..

Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Dear Spark developers, I would like to create a dataframe with one column. However, the createDataFrame method accepts at least a Product: val data = Seq(1.0, 2.0) val rdd = sc.parallelize(data, 2) val df = sqlContext.createDataFrame(rdd) [fail]:25: error: overloaded method value createDataFrame

Re: [MLlib] Extensibility of MLlib classes (Word2VecModel etc.)

2015-09-14 Thread Joseph Bradley
We tend to resist opening up APIs unless there's a strong reason to and we feel reasonably confident that the API will remain stable. That allows us to make fixes if we realize there are issues with those APIs. But if you have an important use case, I'd recommend opening up a JIRA to discuss it.

JavaRDD using Reflection

2015-09-14 Thread Rachana Srivastava
Hello all, I am working a problem that requires us to create different set of JavaRDD based on different input arguments. We are getting following error when we try to use a factory to create JavaRDD. Error message is clear but I am wondering is there any workaround. Question: How to create

Re: Spark Streaming..Exception

2015-09-14 Thread Akhil Das
You should consider upgrading your spark from 1.3.0 to a higher version. Thanks Best Regards On Mon, Sep 14, 2015 at 2:28 PM, Priya Ch wrote: > Hi All, > > I came across the related old conversation on the above issue ( > https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed?

Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Pete Robbins
I keep hitting errors running the tests on 1.5 such as - join31 *** FAILED *** Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 9 in stage 3653.0 failed 1 times, most recent failure: Lost task 9.0 in stage 3653.0 (TID 123363, localhost): java.io.IOExceptio

Re: Spark Streaming..Exception

2015-09-14 Thread Priya Ch
Hi All, I came across the related old conversation on the above issue ( https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed? I tried different values for spark.cleaner.ttl -> 0sec, -1sec, 2000sec,..none of them worked. I also tried setting spark.streaming.unpersist -> true. Wh