[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-03-16 Thread Mark Khaitman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-5782:
-
Priority: Blocker  (was: Critical)

 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman
Priority: Blocker

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I think with the _next_limit method in shuffle.py, if the current memory 
 usage is close to the memory limit, then a 1.05 multiplier can endlessly 
 cause more memory to be consumed by the single python worker, since the max 
 of (512 vs 511 * 1.05) would end up blowing up towards the latter of the 
 two... Shouldn't the memory limit be the absolute cap in this case?
 I've only just started looking into the code, and would definitely love to 
 contribute towards Spark, though I figured it might be quicker to resolve if 
 someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.

2015-03-16 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6366:
---

 Summary: In Python API, the default save mode for save and 
saveAsTable should be error instead of append.
 Key: SPARK-6366
 URL: https://issues.apache.org/jira/browse/SPARK-6366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


If a user want to append data, he/she should explicitly specify the save mode. 
Also, in Scala and Java, the default save mode is ErrorIfExists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects

2015-03-16 Thread Mingyu Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364016#comment-14364016
 ] 

Mingyu Kim commented on SPARK-4808:
---

[~andrewor14], should this now be closed with the fix version 1.4? What's the 
next step?

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6228) Move SASL support into network/common module

2015-03-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6228:
---
Summary: Move SASL support into network/common module  (was: Provide SASL 
support in network/common module)

 Move SASL support into network/common module
 

 Key: SPARK-6228
 URL: https://issues.apache.org/jira/browse/SPARK-6228
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.4.0


 Currently, there's support for SASL in network/shuffle, but not in 
 network/common. Moving the SASL code to network/common would enable other 
 applications using that code to also support secure authentication and, 
 later, encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6319) DISTINCT doesn't work for binary type

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6319:

Target Version/s: 1.4.0  (was: 1.3.1)

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1
Reporter: Cheng Lian

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5310) Update SQL programming guide for 1.3

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364206#comment-14364206
 ] 

Michael Armbrust commented on SPARK-5310:
-

I think I'd rather just publish more examples of writing data sources.  Most 
users will probably not need to know how to do this.

 Update SQL programming guide for 1.3
 

 Key: SPARK-5310
 URL: https://issues.apache.org/jira/browse/SPARK-5310
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical
 Fix For: 1.3.0


 We make quite a few changes. We should update the SQL programming guide to 
 reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6367) Use the proper data type for those expressions that are hijacking existing data types.

2015-03-16 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6367:
---

 Summary: Use the proper data type for those expressions that are 
hijacking existing data types.
 Key: SPARK-6367
 URL: https://issues.apache.org/jira/browse/SPARK-6367
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai


For the following expressions, the actual value type does not match the type of 
our internal representation. 
ApproxCountDistinctPartition
NewSet
AddItemToSet
CombineSets
CollectHashSet

We should create UDTs for data types of these expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints

2015-03-16 Thread Kian Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364246#comment-14364246
 ] 

Kian Ho commented on SPARK-6340:


Hi Joseph,

I initially considered that as a solution, however it was my understanding that 
you couldn't guarantee the same ordering between the instances pre- and post- 
transformations (since the transformations will be distributed across worker 
nodes). Is this correct? This question was also mentioned by a couple of users 
in that thread.

Thanks

 mllib.IDF for LabelPoints
 -

 Key: SPARK-6340
 URL: https://issues.apache.org/jira/browse/SPARK-6340
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
 Environment: python 2.7.8
 pyspark
 OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
Reporter: Kian Ho
Priority: Minor
  Labels: feature

 as per: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
 Having the IDF.fit accept LabelPoints would be useful since, correct me if 
 i'm wrong, there currently isn't a way of keeping track of which labels 
 belong to which documents if one needs to apply a conventional tf-idf 
 transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6304) Checkpointing doesn't retain driver port

2015-03-16 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364254#comment-14364254
 ] 

Saisai Shao commented on SPARK-6304:


Hi [~msoutier], the reason to remove these two configurations, especially 
spark.driver.port is that: SparkContext itself will randomly choose a port 
and set it to configuration even user didn't set it, next time after 
application is recovered, previous configuration spark.driver.port need to 
remove and let SparkContext itself to randomly choose again and set into the 
SparkConf. So that's why checkpoint need to remove these two configurations.

 Checkpointing doesn't retain driver port
 

 Key: SPARK-6304
 URL: https://issues.apache.org/jira/browse/SPARK-6304
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier

 In a check-pointed Streaming application running on a fixed driver port, the 
 setting spark.driver.port is not loaded when recovering from a checkpoint.
 (The driver is then started on a random port.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6146) Support more datatype in SqlParser

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6146:

Target Version/s: 1.4.0  (was: 1.3.0)

 Support more datatype in SqlParser
 --

 Key: SPARK-6146
 URL: https://issues.apache.org/jira/browse/SPARK-6146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, I cannot do 
 {code}
 df.selectExpr(cast(a as bigint))
 {code}
 because only the following data types are supported in SqlParser
 {code}
 protected lazy val dataType: Parser[DataType] =
 ( STRING ^^^ StringType
 | TIMESTAMP ^^^ TimestampType
 | DOUBLE ^^^ DoubleType
 | fixedDecimalType
 | DECIMAL ^^^ DecimalType.Unlimited
 | DATE ^^^ DateType
 | INT ^^^ IntegerType
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5463:

Priority: Blocker  (was: Critical)

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5821) JSONRelation should check if delete is successful for the overwrite operation.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5821:

Target Version/s: 1.3.1  (was: 1.3.0)

 JSONRelation should check if delete is successful for the overwrite operation.
 --

 Key: SPARK-5821
 URL: https://issues.apache.org/jira/browse/SPARK-5821
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yanbo Liang

 When you run CTAS command such as
 CREATE TEMPORARY TABLE jsonTable
 USING org.apache.spark.sql.json.DefaultSource
 OPTIONS (
 path /a/b/c/d
 ) AS
 SELECT a, b FROM jt,
 you will run into failure if you don't have write permission for directory 
 /a/b/c whether d is a directory or file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5463:

Target Version/s: 1.4.0  (was: 1.3.0, 1.2.2)

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5183) Document data source API

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5183.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

 Document data source API
 

 Key: SPARK-5183
 URL: https://issues.apache.org/jira/browse/SPARK-5183
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.3.0


 We need to document the data types the caller needs to support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-6250:
-
  Assignee: Yin Huai  (was: Michael Armbrust)

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364427#comment-14364427
 ] 

Michael Armbrust commented on SPARK-6250:
-

Okay, thanks for explaining the problem!

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6372) spark-submit --conf is not being propagated to child processes

2015-03-16 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-6372:
-

 Summary: spark-submit --conf is not being propagated to child 
processes
 Key: SPARK-6372
 URL: https://issues.apache.org/jira/browse/SPARK-6372
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Blocker


Thanks to [~irashid] for bringing this up. It seems that the new launcher 
library is incorrectly handling --conf and not passing it down to the child 
processes. Fix is simple, PR coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-03-16 Thread Jeffrey Turpin (JIRA)
Jeffrey Turpin created SPARK-6373:
-

 Summary: Add SSL/TLS for the Netty based BlockTransferService 
 Key: SPARK-6373
 URL: https://issues.apache.org/jira/browse/SPARK-6373
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Shuffle
Affects Versions: 1.2.1
Reporter: Jeffrey Turpin
Priority: Minor


Add the ability to allow for secure communications (SSL/TLS) for the Netty 
based BlockTransferService and the ExternalShuffleClient. This ticket will 
hopefully start the conversation around potential designs... Below is a 
reference to a WIP prototype which implements this functionality (prototype)... 
I have attempted to disrupt as little code as possible and tried to follow the 
current code structure (for the most part) in the areas I modified. I also 
studied how Hadoop achieves encrypted shuffle 
(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)


https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-03-16 Thread Mark Khaitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363894#comment-14363894
 ] 

Mark Khaitman commented on SPARK-5782:
--

I've upped this JIRA ticket to blocker since there's a serious memory leak / GC 
problem causing these python workers to reach almost 3GB each sometimes (with a 
512MB default limit).

I'm going to try and re-produce this using non-production data in the meantime.

 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman
Priority: Blocker

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I think with the _next_limit method in shuffle.py, if the current memory 
 usage is close to the memory limit, then a 1.05 multiplier can endlessly 
 cause more memory to be consumed by the single python worker, since the max 
 of (512 vs 511 * 1.05) would end up blowing up towards the latter of the 
 two... Shouldn't the memory limit be the absolute cap in this case?
 I've only just started looking into the code, and would definitely love to 
 contribute towards Spark, though I figured it might be quicker to resolve if 
 someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6327) Run PySpark with python directly is broken

2015-03-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6327.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5019
[https://github.com/apache/spark/pull/5019]

 Run PySpark with python directly is broken
 --

 Key: SPARK-6327
 URL: https://issues.apache.org/jira/browse/SPARK-6327
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.4.0


 It works before, but broken now:
 {code}
 davies@localhost:~/work/spark$ python r.py
 NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
 ahead of assembly.
 Usage: spark-submit [options] app jar | python file [app arguments]
 Usage: spark-submit --kill [submission ID] --master [spark://...]
 Usage: spark-submit --status [submission ID] --master [spark://...]
 Options:
   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
 local.
   --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
 (client) or
   on one of the worker machines inside the 
 cluster (cluster)
   (Default: client).
   --class CLASS_NAME  Your application's main class (for Java / Scala 
 apps).
   --name NAME A name of your application.
   --jars JARS Comma-separated list of local jars to include 
 on the driver
   and executor classpaths.
   --packages  Comma-separated list of maven coordinates of 
 jars to include
   on the driver and executor classpaths. Will 
 search the local
   maven repo, then maven central and any 
 additional remote
   repositories given by --repositories. The 
 format for the
   coordinates should be 
 groupId:artifactId:version.
   --repositories  Comma-separated list of additional remote 
 repositories to
   search for the maven coordinates given with 
 --packages.
   --py-files PY_FILES Comma-separated list of .zip, .egg, or .py 
 files to place
   on the PYTHONPATH for Python apps.
   --files FILES   Comma-separated list of files to be placed in 
 the working
   directory of each executor.
   --conf PROP=VALUE   Arbitrary Spark configuration property.
   --properties-file FILE  Path to a file from which to load extra 
 properties. If not
   specified, this will look for 
 conf/spark-defaults.conf.
   --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
 512M).
   --driver-java-options   Extra Java options to pass to the driver.
   --driver-library-path   Extra library path entries to pass to the 
 driver.
   --driver-class-path Extra class path entries to pass to the driver. 
 Note that
   jars added with --jars are automatically 
 included in the
   classpath.
   --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
 1G).
   --proxy-user NAME   User to impersonate when submitting the 
 application.
   --help, -h  Show this help message and exit
   --verbose, -v   Print additional debug output
   --version,  Print the version of current Spark
  Spark standalone with cluster deploy mode only:
   --driver-cores NUM  Cores for driver (Default: 1).
   --supervise If given, restarts the driver on failure.
   --kill SUBMISSION_IDIf given, kills the driver specified.
   --status SUBMISSION_ID  If given, requests the status of the driver 
 specified.
  Spark standalone and Mesos only:
   --total-executor-cores NUM  Total cores for all executors.
  YARN-only:
   --driver-cores NUM  Number of cores used by the driver, only in 
 cluster mode
   (Default: 1).
   --executor-cores NUMNumber of cores per executor (Default: 1).
   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
 default).
   --num-executors NUM Number of executors to launch (Default: 2).
   --archives ARCHIVES Comma separated list of archives to be 
 extracted into the
   working directory of each executor.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Resolved] (SPARK-5310) Update SQL programming guide for 1.3

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5310.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

 Update SQL programming guide for 1.3
 

 Key: SPARK-5310
 URL: https://issues.apache.org/jira/browse/SPARK-5310
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical
 Fix For: 1.3.0


 We make quite a few changes. We should update the SQL programming guide to 
 reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6340) mllib.IDF for LabelPoints

2015-03-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6340.

Resolution: Not a Problem

 mllib.IDF for LabelPoints
 -

 Key: SPARK-6340
 URL: https://issues.apache.org/jira/browse/SPARK-6340
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
 Environment: python 2.7.8
 pyspark
 OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
Reporter: Kian Ho
Priority: Minor
  Labels: feature

 as per: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
 Having the IDF.fit accept LabelPoints would be useful since, correct me if 
 i'm wrong, there currently isn't a way of keeping track of which labels 
 belong to which documents if one needs to apply a conventional tf-idf 
 transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6247:

Priority: Critical  (was: Major)

 Certain self joins cannot be analyzed
 -

 Key: SPARK-6247
 URL: https://issues.apache.org/jira/browse/SPARK-6247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 When you try the following code
 {code}
 val df =
(1 to 10)
   .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString))
   .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2)
 df.registerTempTable(test)
 sql(
   
   |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol)
   |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1)
   |GROUP BY x.stringCol2
   .stripMargin).explain()
 {code}
 The following exception will be thrown.
 {code}
 [info]   java.util.NoSuchElementException: next on empty iterator
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
 [info]   at 
 scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
 [info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
 [info]   at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 [info]   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 [info]   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 [info]   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
 [info]   at scala.collection.immutable.List.foldLeft(List.scala:84)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
 [info]   at 

[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6231:

Target Version/s: 1.3.1

 Join on two tables (generated from same one) is broken
 --

 Key: SPARK-6231
 URL: https://issues.apache.org/jira/browse/SPARK-6231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical
  Labels: DataFrame

 If the two column used in joinExpr come from the same table, they have the 
 same id, then the joniExpr is explained in wrong way.
 {code}
 val df = sqlContext.load(path, parquet)
 val txns = df.groupBy(cust_id).agg($cust_id, 
 countDistinct($day_num).as(txns))
 val spend = df.groupBy(cust_id).agg($cust_id, 
 sum($extended_price).as(spend))
 val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner)
 scala rmJoin.explain
 == Physical Plan ==
 CartesianProduct
  Filter (cust_id#0 = cust_id#0)
   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
 txns#7L]
Exchange (HashPartitioning [cust_id#0], 200)
 Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
 partialSets#25]
  PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
 newParquet.scala:542
  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
   Exchange (HashPartitioning [cust_id#17], 200)
Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
 PartialSum#38]
 PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
 newParquet.scala:542
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6146) Support more datatype in SqlParser

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364272#comment-14364272
 ] 

Michael Armbrust commented on SPARK-6146:
-

Now that we have our own DDL parser that doesn't live in hive, we should use 
one code path for this.

 Support more datatype in SqlParser
 --

 Key: SPARK-6146
 URL: https://issues.apache.org/jira/browse/SPARK-6146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, I cannot do 
 {code}
 df.selectExpr(cast(a as bigint))
 {code}
 because only the following data types are supported in SqlParser
 {code}
 protected lazy val dataType: Parser[DataType] =
 ( STRING ^^^ StringType
 | TIMESTAMP ^^^ TimestampType
 | DOUBLE ^^^ DoubleType
 | fixedDecimalType
 | DECIMAL ^^^ DecimalType.Unlimited
 | DATE ^^^ DateType
 | INT ^^^ IntegerType
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5881:

Target Version/s: 1.4.0  (was: 1.3.0)

 RDD remains cached after the table gets overridden by CACHE TABLE
 ---

 Key: SPARK-5881
 URL: https://issues.apache.org/jira/browse/SPARK-5881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 {code}
 val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
 sqlContext.jsonRDD(rdd).registerTempTable(jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt)
 {code}
 After the second CACHE TABLE command, the RDD for the first table still 
 remains in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5881:

Priority: Critical  (was: Major)

 RDD remains cached after the table gets overridden by CACHE TABLE
 ---

 Key: SPARK-5881
 URL: https://issues.apache.org/jira/browse/SPARK-5881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 {code}
 val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
 sqlContext.jsonRDD(rdd).registerTempTable(jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt)
 {code}
 After the second CACHE TABLE command, the RDD for the first table still 
 remains in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-16 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364279#comment-14364279
 ] 

Tathagata Das commented on SPARK-5523:
--

As long as the hostname object is short-lived its cool. That's the same
strategy used for StorageLevel. So it is fine.

On Mon, Mar 16, 2015 at 12:36 AM, Saisai Shao (JIRA) j...@apache.org



 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364310#comment-14364310
 ] 

tanyinyan commented on SPARK-6348:
--

Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' 
exactly means :)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364345#comment-14364345
 ] 

Michael Armbrust commented on SPARK-6320:
-

Hmm, interesting.  So far I had only considered this interface for planning 
leaves of the query plan.  Can you tell me more about what you are trying to 
optimize?

 Adding new query plan strategy to SQLContext
 

 Key: SPARK-6320
 URL: https://issues.apache.org/jira/browse/SPARK-6320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Youssef Hatem
Priority: Minor

 Hi,
 I would like to add a new strategy to {{SQLContext}}. To do this I created a 
 new class which extends {{Strategy}}. In my new class I need to call 
 {{planLater}} function. However this method is defined in {{SparkPlanner}} 
 (which itself inherits the method from {{QueryPlanner}}).
 To my knowledge the only way to make {{planLater}} function visible to my new 
 strategy is to define my strategy inside another class that extends 
 {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
 have to extend the {{SQLContext}} such that I can override the {{planner}} 
 field with the new {{Planner}} class I created.
 It seems that this is a design problem because adding a new strategy seems to 
 require extending {{SQLContext}} (unless I am doing it wrong and there is a 
 better way to do it).
 Thanks a lot,
 Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6349) Add probability estimates in SVMModel predict result

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364361#comment-14364361
 ] 

tanyinyan commented on SPARK-6349:
--

Yes, this doesn't solve the problem of picking which threshold. But a raw 
margin usually has no fixed boundary(as i tested above, output margin are all 
negative),but a probability threshold has. So it's more convenient to pick a 
good threshold , right?

 Add probability estimates in SVMModel predict result
 

 Key: SPARK-6349
 URL: https://issues.apache.org/jira/browse/SPARK-6349
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
   Original Estimate: 168h
  Remaining Estimate: 168h

 In SVMModel, predictPoint method output raw margin(threshold not set) or 1/0 
 label(threshold set). 
 when SVM are used as a classifier, it's hard to find a good threshold,and the 
 raw margin is hard to understand. 
 when I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. I have to set threshold to -1 to get a reasonable confusion matrix.
 So, I suggest to provide probability predict result in SVMModel as in 
 libSVM(Platt's binary SVM Probablistic Output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6371) Update version to 1.4.0-SNAPSHOT

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364410#comment-14364410
 ] 

Apache Spark commented on SPARK-6371:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5056

 Update version to 1.4.0-SNAPSHOT
 

 Key: SPARK-6371
 URL: https://issues.apache.org/jira/browse/SPARK-6371
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Marcelo Vanzin
Priority: Critical

 See summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Nitay Joffe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364429#comment-14364429
 ] 

Nitay Joffe commented on SPARK-6250:


Thanks [~marmbrus] and [~yhuai].

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token

2015-03-16 Thread Doug Balog (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364441#comment-14364441
 ] 

Doug Balog commented on SPARK-6207:
---

Need to catch java.lang.UnsupportedOperationException and ignore,
or check to see if delegation token mode is supported with current 
configuration 
before trying to get a delegation token.
See  https://issues.apache.org/jira/browse/HIVE-4625


 YARN secure cluster mode doesn't obtain a hive-metastore token 
 ---

 Key: SPARK-6207
 URL: https://issues.apache.org/jira/browse/SPARK-6207
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, SQL, YARN
Affects Versions: 1.2.0, 1.3.0, 1.2.1
 Environment: YARN
Reporter: Doug Balog

 When running a spark job, on YARN in secure mode, with --deploy-mode 
 cluster,  org.apache.spark.deploy.yarn.Client() does not obtain a delegation 
 token to the hive-metastore. Therefore any attempts to talk to the 
 hive-metastore fail with a GSSException: No valid credentials provided...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6376) Relation are thrown away too early in dataframes

2015-03-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6376:
---

 Summary: Relation are thrown away too early in dataframes
 Key: SPARK-6376
 URL: https://issues.apache.org/jira/browse/SPARK-6376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical


Because we throw away aliases as we construct the query plan, you can't 
reference them later.  For example, this query fails:

{code}
  test(self join with aliases) {
val df = Seq(1,2,3).map(i = (i, i.toString)).toDF(int, str)
checkAnswer(
  df.as('x).join(df.as('y), $x.str === $y.str).groupBy(x.str).count(),
  Row(1, 1) :: Row(2, 1) :: Row(3, 1) :: Nil)
  }
{code}

{code}
[info]   org.apache.spark.sql.AnalysisException: Cannot resolve column name 
x.str among (int, str, int, str);
[info]   at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
[info]   at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6247:

Target Version/s: 1.3.1  (was: 1.3.0)

 Certain self joins cannot be analyzed
 -

 Key: SPARK-6247
 URL: https://issues.apache.org/jira/browse/SPARK-6247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 When you try the following code
 {code}
 val df =
(1 to 10)
   .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString))
   .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2)
 df.registerTempTable(test)
 sql(
   
   |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol)
   |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1)
   |GROUP BY x.stringCol2
   .stripMargin).explain()
 {code}
 The following exception will be thrown.
 {code}
 [info]   java.util.NoSuchElementException: next on empty iterator
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
 [info]   at 
 scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
 [info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
 [info]   at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 [info]   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 [info]   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 [info]   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
 [info]   at scala.collection.immutable.List.foldLeft(List.scala:84)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
 [info]   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
 [info]   at 

[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints

2015-03-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364266#comment-14364266
 ] 

Joseph K. Bradley commented on SPARK-6340:
--

You should be able to reliably zip the RDDs back together.  I just send an 
update to that post, which I'll copy here:

{quote}
This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340  
so I'll answer one item which was asked about the reliability of zipping RDDs.  
Basically, it should be reliable, and if it is not, then it should be reported 
as a bug.  This general approach should work (with explicit types to make it 
clear):

{code}
val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) 
= LabeledPoint(label, features))
{code}
{quote}

Do report it if you run into problems with this!  Thanks.

 mllib.IDF for LabelPoints
 -

 Key: SPARK-6340
 URL: https://issues.apache.org/jira/browse/SPARK-6340
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
 Environment: python 2.7.8
 pyspark
 OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
Reporter: Kian Ho
Priority: Minor
  Labels: feature

 as per: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
 Having the IDF.fit accept LabelPoints would be useful since, correct me if 
 i'm wrong, there currently isn't a way of keeping track of which labels 
 belong to which documents if one needs to apply a conventional tf-idf 
 transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6247:

Priority: Blocker  (was: Critical)

 Certain self joins cannot be analyzed
 -

 Key: SPARK-6247
 URL: https://issues.apache.org/jira/browse/SPARK-6247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Blocker

 When you try the following code
 {code}
 val df =
(1 to 10)
   .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString))
   .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2)
 df.registerTempTable(test)
 sql(
   
   |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol)
   |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1)
   |GROUP BY x.stringCol2
   .stripMargin).explain()
 {code}
 The following exception will be thrown.
 {code}
 [info]   java.util.NoSuchElementException: next on empty iterator
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
 [info]   at 
 scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
 [info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
 [info]   at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 [info]   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 [info]   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 [info]   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
 [info]   at scala.collection.immutable.List.foldLeft(List.scala:84)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
 [info]   at 

[jira] [Updated] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6368:

Priority: Critical  (was: Major)

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6367) Use the proper data type for those expressions that are hijacking existing data types.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6367:

Assignee: Yin Huai

 Use the proper data type for those expressions that are hijacking existing 
 data types.
 --

 Key: SPARK-6367
 URL: https://issues.apache.org/jira/browse/SPARK-6367
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai

 For the following expressions, the actual value type does not match the type 
 of our internal representation. 
 ApproxCountDistinctPartition
 NewSet
 AddItemToSet
 CombineSets
 CollectHashSet
 We should create UDTs for data types of these expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364297#comment-14364297
 ] 

Michael Armbrust commented on SPARK-6200:
-

I'll add this seems to be mostly implemented already here: 
https://github.com/apache/spark/pull/4015

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350
 ] 

yuhao yang commented on SPARK-5563:
---

Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation provided by Matthew D. 
Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [jira] [Created] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-16 Thread Sean Owen
What's the bug? Each element is sampled with probability 0.5. I think the
expected size is 14 but not all samples would be that size.
On Mar 17, 2015 12:12 AM, Marko Bonaci (JIRA) j...@apache.org wrote:

 Marko Bonaci created SPARK-6370:
 ---

  Summary: RDD sampling with replacement intermittently yields
 incorrect number of samples
  Key: SPARK-6370
  URL: https://issues.apache.org/jira/browse/SPARK-6370
  Project: Spark
   Issue Type: Bug
   Components: Spark Core
 Affects Versions: 1.2.1, 1.3.0
  Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
 Reporter: Marko Bonaci


 Here's the repl output:

 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46,
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at
 sample at console:27

 scala swr.count
 res17: Long = 16

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at
 sample at console:27

 scala swr.count
 res18: Long = 8

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at
 sample at console:27

 scala swr.count
 res19: Long = 18

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at
 sample at console:27

 scala swr.count
 res20: Long = 15

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at
 sample at console:27

 scala swr.count
 res21: Long = 11

 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at
 sample at console:27

 scala swr.count
 res22: Long = 10
 {{code}}



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)

 -
 To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
 For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Resolved] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6250.
-
Resolution: Won't Fix

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364353#comment-14364353
 ] 

Michael Armbrust commented on SPARK-6250:
-

We have confirmed that this does work if you escape the field names using 
backticks.  Since this is pretty standard, I'm going to close this wont fix.  
If there is some case where this is not possible, please reopen with details.

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Nitay Joffe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364417#comment-14364417
 ] 

Nitay Joffe commented on SPARK-6250:


The error is always the same: https://gist.github.com/nitay/8ba0efd739cf2e22ad23

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364264#comment-14364264
 ] 

Michael Armbrust commented on SPARK-6200:
-

Thank you for working on this.  I would like to support plug-able dialects, but 
I'm not sure I fully agree with the implementation.  In general it would be 
good to post a design on the JIRA and get agreement before doing too much 
implementation.

At a high level, I wonder if something much simpler would be sufficient.  I 
don't expect that users will spend a lot of time switching between dialects.  
Probably they will configure their preferred one in spark.defaults and never 
think about it again.  Additionally, we will still need {{SET 
spark.sql.dialect=}} as this is public API, so why not just extend that?

Basically I would propose we do the following.  Add a simple interface 
{{Dialect}} that takes a {{String}} and returns a {{LogicalPlan}} as you have 
done.  For the built in ones you just say {{SET spark.sql.dialect=sql}} or 
{{SET spark.sql.dialect=hiveql}}.  For external one you simply provide the 
fully qualified classname.  It would also be good to be clear in the interface 
what the contract is for DDL.  I would suggest that Spark SQL always parses its 
own DDL first and only defers to the dialect when the build in DDL parser does 
not handle the given string.

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6109:

Target Version/s: 1.4.0  (was: 1.3.0)

 Unit tests fail when compiled against Hive 0.12.0
 -

 Key: SPARK-6109
 URL: https://issues.apache.org/jira/browse/SPARK-6109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several 
 Hive 0.13.1 specific test cases always fail against Hive 0.12.0. Need to 
 blacklist them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6199) Support CTE

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6199:

Target Version/s: 1.4.0  (was: 1.3.0)

 Support CTE
 ---

 Key: SPARK-6199
 URL: https://issues.apache.org/jira/browse/SPARK-6199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Support CTE in SQLContext and HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5911:

Target Version/s: 1.4.0  (was: 1.3.0)

 Make Column.cast(to: String) support fixed precision and scale decimal type
 ---

 Key: SPARK-5911
 URL: https://issues.apache.org/jira/browse/SPARK-5911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6366:

Priority: Blocker  (was: Major)

 In Python API, the default save mode for save and saveAsTable should be 
 error instead of append.
 

 Key: SPARK-6366
 URL: https://issues.apache.org/jira/browse/SPARK-6366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 If a user want to append data, he/she should explicitly specify the save 
 mode. Also, in Scala and Java, the default save mode is ErrorIfExists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6247) Certain self joins cannot be analyzed

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-6247:
---

Assignee: Michael Armbrust

 Certain self joins cannot be analyzed
 -

 Key: SPARK-6247
 URL: https://issues.apache.org/jira/browse/SPARK-6247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Critical

 When you try the following code
 {code}
 val df =
(1 to 10)
   .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString))
   .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2)
 df.registerTempTable(test)
 sql(
   
   |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol)
   |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1)
   |GROUP BY x.stringCol2
   .stripMargin).explain()
 {code}
 The following exception will be thrown.
 {code}
 [info]   java.util.NoSuchElementException: next on empty iterator
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 [info]   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
 [info]   at 
 scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
 [info]   at scala.collection.IterableLike$class.head(IterableLike.scala:91)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
 [info]   at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]   at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 [info]   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 [info]   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 [info]   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 [info]   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
 [info]   at scala.collection.immutable.List.foldLeft(List.scala:84)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071)
 [info]   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
 [info]   at 

[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6231:

Priority: Blocker  (was: Critical)

 Join on two tables (generated from same one) is broken
 --

 Key: SPARK-6231
 URL: https://issues.apache.org/jira/browse/SPARK-6231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Blocker
  Labels: DataFrame

 If the two column used in joinExpr come from the same table, they have the 
 same id, then the joniExpr is explained in wrong way.
 {code}
 val df = sqlContext.load(path, parquet)
 val txns = df.groupBy(cust_id).agg($cust_id, 
 countDistinct($day_num).as(txns))
 val spend = df.groupBy(cust_id).agg($cust_id, 
 sum($extended_price).as(spend))
 val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner)
 scala rmJoin.explain
 == Physical Plan ==
 CartesianProduct
  Filter (cust_id#0 = cust_id#0)
   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
 txns#7L]
Exchange (HashPartitioning [cust_id#0], 200)
 Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
 partialSets#25]
  PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
 newParquet.scala:542
  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
   Exchange (HashPartitioning [cust_id#17], 200)
Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
 PartialSum#38]
 PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
 newParquet.scala:542
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6366:

Assignee: Yin Huai

 In Python API, the default save mode for save and saveAsTable should be 
 error instead of append.
 

 Key: SPARK-6366
 URL: https://issues.apache.org/jira/browse/SPARK-6366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai

 If a user want to append data, he/she should explicitly specify the save 
 mode. Also, in Scala and Java, the default save mode is ErrorIfExists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364309#comment-14364309
 ] 

tanyinyan commented on SPARK-6348:
--

Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' 
exactly means :)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364311#comment-14364311
 ] 

tanyinyan commented on SPARK-6348:
--

Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' 
exactly means :)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364313#comment-14364313
 ] 

tanyinyan commented on SPARK-6348:
--

Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' 
exactly means :)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364312#comment-14364312
 ] 

tanyinyan commented on SPARK-6348:
--

Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' 
exactly means :)

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanyinyan updated SPARK-6348:
-
Comment: was deleted

(was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before 
SVM ' exactly means :))

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanyinyan updated SPARK-6348:
-
Comment: was deleted

(was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before 
SVM ' exactly means :))

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanyinyan updated SPARK-6348:
-
Comment: was deleted

(was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before 
SVM ' exactly means :))

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread tanyinyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanyinyan updated SPARK-6348:
-
Comment: was deleted

(was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before 
SVM ' exactly means :))

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference

2015-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350
 ] 

yuhao yang edited comment on SPARK-5563 at 3/17/15 1:13 AM:


Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation (C++) provided by Matthew 
D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.


was (Author: yuhaoyan):
Matthew Willson. Thanks for the attention and idea. Apart from Gensim, 
vowpal-wabbit also has a distributed implementation provided by Matthew D. 
Hoffman, which seems to be amazingly fast. I'll refer to those libraries as 
much as possible. And suggestions are always welcome.

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Nitay Joffe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364414#comment-14364414
 ] 

Nitay Joffe commented on SPARK-6250:


Backticks doesn't work for me on existing data. For example I work with a table 
that is structured like:

...
foo: structtimestamp:bigint,timezone:string
...

Selecting *, `foo`, `foo.timestamp`, foo.`timestamp` all don't work.
What am I doing wrong?




 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6375) Bad formatting in analysis errors

2015-03-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6375:
---

 Summary: Bad formatting in analysis errors
 Key: SPARK-6375
 URL: https://issues.apache.org/jira/browse/SPARK-6375
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical


{code}
[info]   org.apache.spark.sql.AnalysisException: Ambiguous references to str: 
(str#3,List()),(str#5,List());
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6377) Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation.

2015-03-16 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6377:
---

 Summary: Set the number of shuffle partitions automatically based 
on the size of input tables and the reduce-side operation.
 Key: SPARK-6377
 URL: https://issues.apache.org/jira/browse/SPARK-6377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


It will be helpful to automatically set the number of shuffle partitions based 
on the size of input tables and the operation at the reduce side for an 
Exchange operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6369:
---

 Summary: InsertIntoHiveTable should use logic from 
SparkHadoopWriter
 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical


Right now it is possible that we will corrupt the output if there is a race 
between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5451) And predicates are not properly pushed down

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5451:

Target Version/s: 1.4.0  (was: 1.3.0, 1.2.2)

 And predicates are not properly pushed down
 ---

 Key: SPARK-5451
 URL: https://issues.apache.org/jira/browse/SPARK-5451
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 This issue is actually caused by PARQUET-173.
 The following {{spark-shell}} session can be used to reproduce this bug:
 {code}
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sc._
 import sqlContext._
 case class KeyValue(key: Int, value: String)
 parallelize(1 to 1024 * 1024 * 20).
   flatMap(i = Seq.fill(10)(KeyValue(i, i.toString))).
   saveAsParquetFile(large.parquet)
 parquetFile(large.parquet).registerTempTable(large)
 hadoopConfiguration.set(parquet.task.side.metadata, false)
 sql(SET spark.sql.parquet.filterPushdown=true)
 sql(SELECT value FROM large WHERE 1024  value AND value  2048).collect()
 {code}
 From the log we can find:
 {code}
 There were no row groups that could be dropped due to filter predicates
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6200) Support dialect in SQL

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6200:

Target Version/s: 1.4.0  (was: 1.3.0)

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5183) Document data source API

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5183:
---

Assignee: Michael Armbrust

 Document data source API
 

 Key: SPARK-5183
 URL: https://issues.apache.org/jira/browse/SPARK-5183
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.3.0


 We need to document the data types the caller needs to support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364355#comment-14364355
 ] 

Sean Owen commented on SPARK-6370:
--

What's the bug? Each element is sampled with probability 0.5. I think the
expected size is 14 but not all samples would be that size.



 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples

2015-03-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364354#comment-14364354
 ] 

Sean Owen commented on SPARK-6370:
--

What's the bug? Each element is sampled with probability 0.5. I think the
expected size is 14 but not all samples would be that size.



 RDD sampling with replacement intermittently yields incorrect number of 
 samples
 ---

 Key: SPARK-6370
 URL: https://issues.apache.org/jira/browse/SPARK-6370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4
Reporter: Marko Bonaci
  Labels: PoissonSampler, sample, sampler

 Here's the repl output:
 {{code:java}}
 scala uniqueIds.collect
 res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 
 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10)
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample 
 at console:27
 scala swr.count
 res17: Long = 16
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample 
 at console:27
 scala swr.count
 res18: Long = 8
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample 
 at console:27
 scala swr.count
 res19: Long = 18
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample 
 at console:27
 scala swr.count
 res20: Long = 15
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample 
 at console:27
 scala swr.count
 res21: Long = 11
 scala val swr = uniqueIds.sample(true, 0.5)
 swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample 
 at console:27
 scala swr.count
 res22: Long = 10
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Nitay Joffe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364367#comment-14364367
 ] 

Nitay Joffe commented on SPARK-6250:


Is it hard to fix this? Seems to me it would be a pretty straightforward thing 
to do? It's not always easy to change the underlying data model, for example 
working with existing hive metastore not under your control.

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364373#comment-14364373
 ] 

Michael Armbrust commented on SPARK-6250:
-

I'm not suggesting you change your datamodel.  Just that if you are going to 
name your columns after datatypes you escape them with backticks (as you must 
do when using any reserved word or non-standard characters).  When interacting 
with the Hive metastore I would expect all required escaping to happen 
automatically.  Please let me know if you have a counter example.

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Nitay Joffe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364375#comment-14364375
 ] 

Nitay Joffe commented on SPARK-6250:


How would I do a select * from an existing hive metastores which have types in 
their names?

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364408#comment-14364408
 ] 

Apache Spark commented on SPARK-6348:
-

User 'tanyinyan' has created a pull request for this issue:
https://github.com/apache/spark/pull/5055

 Enable useFeatureScaling in SVMWithSGD
 --

 Key: SPARK-6348
 URL: https://issues.apache.org/jira/browse/SPARK-6348
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: tanyinyan
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 Currently,useFeatureScaling are set to false by default in class 
 GeneralizedLinearAlgorithm, and it is only enabled in 
 LogisticRegressionWithLBFGS.
 SVMWithSGD class is a private class,train methods are provide in SVMWithSGD 
 object. So there is no way to set useFeatureScaling when using SVM.
 I am using SVM on 
 dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the 
 first day's dataset(ignore field id/device_id/device_ip, all remaining fields 
 are concidered as categorical variable, and sparsed before SVM) and predict 
 on the same data with threshold cleared, the predict result are all  
 negative. Then i set useFeatureScaling to true, the predict result are 
 normal(including negative and positive result)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6371) Update version to 1.4.0-SNAPSHOT

2015-03-16 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-6371:
-

 Summary: Update version to 1.4.0-SNAPSHOT
 Key: SPARK-6371
 URL: https://issues.apache.org/jira/browse/SPARK-6371
 Project: Spark
  Issue Type: Task
  Components: Build
Reporter: Marcelo Vanzin
Priority: Critical


See summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6372) spark-submit --conf is not being propagated to child processes

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1436#comment-1436
 ] 

Apache Spark commented on SPARK-6372:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5057

 spark-submit --conf is not being propagated to child processes
 

 Key: SPARK-6372
 URL: https://issues.apache.org/jira/browse/SPARK-6372
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Blocker

 Thanks to [~irashid] for bringing this up. It seems that the new launcher 
 library is incorrectly handling --conf and not passing it down to the child 
 processes. Fix is simple, PR coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364471#comment-14364471
 ] 

Apache Spark commented on SPARK-6374:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/5058

 Add getter for GeneralizedLinearAlgorithm
 -

 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 I find it's better to have getter for NumFeatures and addIntercept within 
 GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get 
 the value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-03-16 Thread yuhao yang (JIRA)
yuhao yang created SPARK-6374:
-

 Summary: Add getter for GeneralizedLinearAlgorithm
 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor


I find it's better to have getter for NumFeatures and addIntercept within 
GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the 
value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6330:

Target Version/s: 1.3.1

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
 Fix For: 1.4.0, 1.3.1


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-16 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363767#comment-14363767
 ] 

Manoj Kumar edited comment on SPARK-6192 at 3/17/15 3:16 AM:
-

[~mengxr] Google Summer of Code applications are open today. I have submitted 
my proposal here, 
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264

It would be great if you could provide feedback, register as a mentor, and do 
the needful as described here 
(https://community.apache.org/mentee-ranking-process.html). Thanks!


was (Author: mechcoder):
[~mengxr] Google Summer of Code applications are open today. I have submitted 
my proposal here, 
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264

It would be great if you could register as a mentor, and do the needful as 
described here (https://community.apache.org/mentee-ranking-process.html). 
Thanks!

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364490#comment-14364490
 ] 

Apache Spark commented on SPARK-5068:
-

User 'lazyman500' has created a pull request for this issue:
https://github.com/apache/spark/pull/5059

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn

 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian 

[jira] [Comment Edited] (SPARK-6340) mllib.IDF for LabelPoints

2015-03-16 Thread Kian Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364246#comment-14364246
 ] 

Kian Ho edited comment on SPARK-6340 at 3/17/15 12:01 AM:
--

Hi Joseph,

I initially considered that as a solution, however it was my understanding that 
you couldn't guarantee the same ordering between the instances pre- and post- 
transformations (since the transformations will be distributed across worker 
nodes). Hence, you may end up with features that will be zipped with labels 
they weren't originally assigned. Is this correct? This question was also 
mentioned by a couple of users in that thread.

Thanks


was (Author: kian.ho):
Hi Joseph,

I initially considered that as a solution, however it was my understanding that 
you couldn't guarantee the same ordering between the instances pre- and post- 
transformations (since the transformations will be distributed across worker 
nodes). Is this correct? This question was also mentioned by a couple of users 
in that thread.

Thanks

 mllib.IDF for LabelPoints
 -

 Key: SPARK-6340
 URL: https://issues.apache.org/jira/browse/SPARK-6340
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
 Environment: python 2.7.8
 pyspark
 OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
Reporter: Kian Ho
Priority: Minor
  Labels: feature

 as per: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
 Having the IDF.fit accept LabelPoints would be useful since, correct me if 
 i'm wrong, there currently isn't a way of keeping track of which labels 
 belong to which documents if one needs to apply a conventional tf-idf 
 transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints

2015-03-16 Thread Kian Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364321#comment-14364321
 ] 

Kian Ho commented on SPARK-6340:


I appreciate the swift response! happy to keep this issue closed.

 mllib.IDF for LabelPoints
 -

 Key: SPARK-6340
 URL: https://issues.apache.org/jira/browse/SPARK-6340
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
 Environment: python 2.7.8
 pyspark
 OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
Reporter: Kian Ho
Priority: Minor
  Labels: feature

 as per: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
 Having the IDF.fit accept LabelPoints would be useful since, correct me if 
 i'm wrong, there currently isn't a way of keeping track of which labels 
 belong to which documents if one needs to apply a conventional tf-idf 
 transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-03-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6368:

Assignee: Yin Huai

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6293) SQLContext.implicits should provide automatic conversion for RDD[Row]

2015-03-16 Thread Chen Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364340#comment-14364340
 ] 

Chen Song commented on SPARK-6293:
--

OK, I have created a pull request https://github.com/apache/spark/pull/5040. 
I'm the user KAKA1992.

 SQLContext.implicits should provide automatic conversion for RDD[Row]
 -

 Key: SPARK-6293
 URL: https://issues.apache.org/jira/browse/SPARK-6293
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 When a DataFrame is converted to an RDD[Row], it should be easier to convert 
 it back to a DataFrame via toDF.  E.g.:
 {code}
 val df: DataFrame = myRDD.toDF(col1, col2)  // This works for types like 
 RDD[scala.Tuple2[...]]
 val splits = df.rdd.randomSplit(...)
 val split0: RDD[Row] = splits(0)
 val df0 = split0.toDF(col1, col2) // This fails
 {code}
 The failure happens because SQLContext.implicits does not provide an 
 automatic conversion for Rows.  (It does handle Products, but Row does not 
 implement Product.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.

2015-03-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364382#comment-14364382
 ] 

Yin Huai commented on SPARK-6250:
-

[~nitay] Have you tried backticks? Does it work?

 Types are now reserved words in DDL parser.
 ---

 Key: SPARK-6250
 URL: https://issues.apache.org/jira/browse/SPARK-6250
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file

2015-03-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363232#comment-14363232
 ] 

Sean Owen commented on SPARK-6355:
--

Oh, I learned something then. Yeah that looks like the intended behavior and 
this should work. Maybe it is just not applied in standalone mode for the app 
jar.

 Spark standalone cluster does not support local:/ url for jar file
 --

 Key: SPARK-6355
 URL: https://issues.apache.org/jira/browse/SPARK-6355
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
Reporter: Jesper Lundgren

 Submitting a new spark application to a standalone cluster with local:/path 
 will result in an exception.
 Driver successfully submitted as driver-20150316171157-0004
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150316171157-0004 is ERROR
 Exception from cluster was: java.io.IOException: No FileSystem for scheme: 
 local
 java.io.IOException: No FileSystem for scheme: local
   at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file

2015-03-16 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363227#comment-14363227
 ] 

Jesper Lundgren edited comment on SPARK-6355 at 3/16/15 2:14 PM:
-

[~srowen] Thank you for your reply. 
I use spark-submit --class class.Main local:/application.jar . 
https://spark.apache.org/docs/1.2.1/submitting-applications.html under 
Advanced Dependency Management mentions local:/ can be used when a jar is 
pre-distributed instead of uploading using the built in file server. Maybe I am 
misunderstanding but I believe it is meant to work for the main application jar 
as well as for --jars config option.

I am running standalone cluster with Zookeeper HA and have on occasion had 
problem crashing on restart due to the spark fileserver being unavailable to 
distribute the jar to the worker nodes (I can't reliably reproduce this yet). I 
intended to use local:/ as a fix but seems this option does not work in 
standalone cluster.


was (Author: koudelka):
[~srowen] spark-submit --class class.Main local:/application.jar . 
https://spark.apache.org/docs/1.2.1/submitting-applications.html under 
Advanced Dependency Management mentions local:/ can be used when a jar is 
pre-distributed instead of uploading using the built in file server. Maybe I am 
misunderstanding but I believe it is meant to work for the main application jar 
as well as for --jars config option.

I am running standalone cluster with Zookeeper HA and have on occasion had 
problem crashing on restart due to the spark fileserver being unavailable to 
distribute the jar to the worker nodes (I can't reliably reproduce this yet). I 
intended to use local:/ as a fix but seems this option does not work in 
standalone cluster.

 Spark standalone cluster does not support local:/ url for jar file
 --

 Key: SPARK-6355
 URL: https://issues.apache.org/jira/browse/SPARK-6355
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
Reporter: Jesper Lundgren

 Submitting a new spark application to a standalone cluster with local:/path 
 will result in an exception.
 Driver successfully submitted as driver-20150316171157-0004
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150316171157-0004 is ERROR
 Exception from cluster was: java.io.IOException: No FileSystem for scheme: 
 local
 java.io.IOException: No FileSystem for scheme: local
   at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file

2015-03-16 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363242#comment-14363242
 ] 

Jesper Lundgren edited comment on SPARK-6355 at 3/16/15 2:23 PM:
-

I haven't tried to see if --jars is working properly. There is (or at least 
used to be) an similar issue with spark-submit --master option. The spark 
standalone cluster documentation says that multiple master hosts can be 
provided as a comma separated list like spark://host1:port1,host2:port2 but 
it did not work (at least in  1.2.0 ). But it does work when setting the 
master url within the driver code while starting the SparkContext. I am 
wondering if I can do a similar workaround for this issue.


was (Author: koudelka):
I haven't tried to see if --jars is working properly. There is (or at least 
used to be) an similar issue with spark-submit --master option. The spark 
standalone cluster documentation says that multiple master hosts can be 
provided as a comma separated list like spark://host1:port1,host2:port2 but 
it did not work (at least in  1.2.0 ). But it does work when setting the 
master url within the driver code starting the SparkContext. I am wondering if 
I can do a similar workaround for this issue.

 Spark standalone cluster does not support local:/ url for jar file
 --

 Key: SPARK-6355
 URL: https://issues.apache.org/jira/browse/SPARK-6355
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
Reporter: Jesper Lundgren

 Submitting a new spark application to a standalone cluster with local:/path 
 will result in an exception.
 Driver successfully submitted as driver-20150316171157-0004
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150316171157-0004 is ERROR
 Exception from cluster was: java.io.IOException: No FileSystem for scheme: 
 local
 java.io.IOException: No FileSystem for scheme: local
   at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file

2015-03-16 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363227#comment-14363227
 ] 

Jesper Lundgren commented on SPARK-6355:


[~srowen] spark-submit --class class.Main local:/application.jar . 
https://spark.apache.org/docs/1.2.1/submitting-applications.html under 
Advanced Dependency Management mentions local:/ can be used when a jar is 
pre-distributed instead of uploading using the built in file server. Maybe I am 
misunderstanding but I believe it is meant to work for the main application jar 
as well as for --jars config option.

I am running standalone cluster with Zookeeper HA and have on occasion had 
problem crashing on restart due to the spark fileserver being unavailable to 
distribute the jar to the worker nodes (I can't reliably reproduce this yet). I 
intended to use local:/ as a fix but seems this option does not work in 
standalone cluster.

 Spark standalone cluster does not support local:/ url for jar file
 --

 Key: SPARK-6355
 URL: https://issues.apache.org/jira/browse/SPARK-6355
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
Reporter: Jesper Lundgren

 Submitting a new spark application to a standalone cluster with local:/path 
 will result in an exception.
 Driver successfully submitted as driver-20150316171157-0004
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150316171157-0004 is ERROR
 Exception from cluster was: java.io.IOException: No FileSystem for scheme: 
 local
 java.io.IOException: No FileSystem for scheme: local
   at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3278) Isotonic regression

2015-03-16 Thread Martin Zapletal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362986#comment-14362986
 ] 

Martin Zapletal commented on SPARK-3278:


Vladimir,

just to update you on the progress. I was able to complete the isotonic 
regression with 100M records, but failed with insufficient memory error with 
150M records on my machine. You may be able to run with larger amounts of data 
on better machines. 

Pool adjacent violators algorithm can theoretically have linear time 
complexity, but although I have used the best algorithm I could find I am not 
convinced it reaches this efficiency. I will work on providing evidence.

The biggest issue with the current algorithm is however with the 
parallelization approach. Its properties are unfortunately nowhere near linear 
scalability (linear solution time increase with linear parallelism increase or 
constant solution time with linear parallelism increase and linear problem size 
increase). This was expected and is caused by the algorithm itself for the 
following reasons

1) The algorithm works in two steps. First the computation is distributed to 
all partitions, but then gathered and the algorithm is run again on the whole 
data set. This approach may leave most of work for the last sequential step and 
thus gaining very little compared to purely sequential implementation or even 
performing worse. That can happen in case where parallel isotonic regressions 
return a locally optimal solution that will however have to change for a global 
solution in the last step. Another performance drawback in comparison to 
sequential processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last 
step (or repeated disk access).

I started looking into the issue and was able to design an iterative algorithm 
that adressed both the above issues and performed very close to linear 
scalability. It however still has correctness (rounding) issues and will 
require further research.

Let me know if that helped. In the meantime I will continue working on 
benchmarks and performance quantification of the current algorithm as well as 
on research for potentially more efficient solutions.

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.

2015-03-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363239#comment-14363239
 ] 

Apache Spark commented on SPARK-6299:
-

User 'swkimme' has created a pull request for this issue:
https://github.com/apache/spark/pull/5046

 ClassNotFoundException in standalone mode when running groupByKey with class 
 defined in REPL.
 -

 Key: SPARK-6299
 URL: https://issues.apache.org/jira/browse/SPARK-6299
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0, 1.2.1
Reporter: Kevin (Sangwoo) Kim

 Anyone can reproduce this issue by the code below
 (runs well in local mode, got exception with clusters)
 (it runs well in Spark 1.1.1)
 case class ClassA(value: String)
 val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) ))
 rdd.groupByKey.collect
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
 java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
 at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 

[jira] [Comment Edited] (SPARK-3278) Isotonic regression

2015-03-16 Thread Martin Zapletal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362986#comment-14362986
 ] 

Martin Zapletal edited comment on SPARK-3278 at 3/16/15 9:37 AM:
-

Vladimir,

just to update you on the progress. I was able to complete the isotonic 
regression with 100M records (similar data to what you requested, using NASDAQ 
index prices), but failed with insufficient memory error with 150M records on 
my machine. You may be able to run with larger amounts of data on better 
machines. 

Pool adjacent violators algorithm can theoretically have linear time 
complexity, but although I have used the best algorithm I could find I am not 
convinced it reaches this efficiency. I will work on providing evidence.

The biggest issue with the current algorithm is however with the 
parallelization approach. Its properties are unfortunately nowhere near linear 
scalability (linear solution time increase with linear parallelism increase or 
constant solution time with linear parallelism increase and linear problem size 
increase). This was expected and is caused by the algorithm itself for the 
following reasons

1) The algorithm works in two steps. First the computation is distributed to 
all partitions, but then gathered and the algorithm is run again on the whole 
data set. This approach may leave most of work for the last sequential step and 
thus gaining very little compared to purely sequential implementation or even 
performing worse. That can happen in case where parallel isotonic regressions 
return a locally optimal solution that will however have to change for a global 
solution in the last step. Another performance drawback in comparison to 
sequential processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last 
step (or repeated disk access).

I started looking into the issue and was able to design an iterative algorithm 
that adressed both the above issues and performed very close to linear 
scalability. It however still has correctness (rounding) issues and will 
require further research.

Let me know if that helped. In the meantime I will continue working on 
benchmarks and performance quantification of the current algorithm as well as 
on research for potentially more efficient solutions.


was (Author: zapletal-martin):
Vladimir,

just to update you on the progress. I was able to complete the isotonic 
regression with 100M records, but failed with insufficient memory error with 
150M records on my machine. You may be able to run with larger amounts of data 
on better machines. 

Pool adjacent violators algorithm can theoretically have linear time 
complexity, but although I have used the best algorithm I could find I am not 
convinced it reaches this efficiency. I will work on providing evidence.

The biggest issue with the current algorithm is however with the 
parallelization approach. Its properties are unfortunately nowhere near linear 
scalability (linear solution time increase with linear parallelism increase or 
constant solution time with linear parallelism increase and linear problem size 
increase). This was expected and is caused by the algorithm itself for the 
following reasons

1) The algorithm works in two steps. First the computation is distributed to 
all partitions, but then gathered and the algorithm is run again on the whole 
data set. This approach may leave most of work for the last sequential step and 
thus gaining very little compared to purely sequential implementation or even 
performing worse. That can happen in case where parallel isotonic regressions 
return a locally optimal solution that will however have to change for a global 
solution in the last step. Another performance drawback in comparison to 
sequential processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last 
step (or repeated disk access).

I started looking into the issue and was able to design an iterative algorithm 
that adressed both the above issues and performed very close to linear 
scalability. It however still has correctness (rounding) issues and will 
require further research.

Let me know if that helped. In the meantime I will continue working on 
benchmarks and performance quantification of the current algorithm as well as 
on research for potentially more efficient solutions.

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by 

[jira] [Resolved] (SPARK-6300) sc.addFile(path) does not support the relative path.

2015-03-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6300.
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0

Issue resolved by pull request 4993
[https://github.com/apache/spark/pull/4993]

 sc.addFile(path) does not support the relative path.
 

 Key: SPARK-6300
 URL: https://issues.apache.org/jira/browse/SPARK-6300
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1
Reporter: DoingDone9
Assignee: DoingDone9
Priority: Critical
 Fix For: 1.4.0, 1.3.1


 when i run cmd like that sc.addFile(../test.txt), it did not work and throw 
 an exception
 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
 path in absolute URI: file:../test.txt
 at org.apache.hadoop.fs.Path.initialize(Path.java:206)
 at org.apache.hadoop.fs.Path.init(Path.java:172) 
 
 ...
 Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
 file:../test.txt
 at java.net.URI.checkPath(URI.java:1804)
 at java.net.URI.init(URI.java:752)
 at org.apache.hadoop.fs.Path.initialize(Path.java:203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6316) add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file , e.g. gbk

2015-03-16 Thread yunzhi.lyz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363161#comment-14363161
 ] 

yunzhi.lyz commented on SPARK-6316:
---

i have a try  
read  file non UTF-8 encodings code example:
 
sc.hadoopFile(/inputdir, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text],5).map(pair = new String(pair._2.getBytes(), 0 , 
pair._2.getLength(), gbk))

write file non  UTF-8 encodings   code example:

file.map(x = (NullWritable.get(), new 
Text(String.valueOf(x).getBytes(gbk.saveAsHadoopFile[TextOutputFormat[NullWritable,
 Text]](/output)

RESOLVED   sc.textFile and rdd.saveAsTextFile   not support non UTF-8 encodings 
 question  


 add a parameter for  SparkContext(conf).textFile() method , support for 
 multi-language  hdfs file ,   e.g. gbk
 

 Key: SPARK-6316
 URL: https://issues.apache.org/jira/browse/SPARK-6316
 Project: Spark
  Issue Type: New Feature
 Environment: linux   
 LANG=en_US.UTF-8
Reporter: yunzhi.lyz

 add a parameter for  SparkContext(conf).textFile() method , support 
 for multi-language  hdfs file .
   
e.g. val file = new SparkContext(conf).textFile(args(0), 10,gbk)
 modify the code:

   org.apache.spark.SparkContext
  +  def defaultEncoding: String = utf-8
 
  --   def textFile(path: String, minPartitions: Int = 
 defaultMinPartitions): RDD[String] = {
 hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text],
   minPartitions).map(pair = pair._2.toString).setName(path)
   }
++def textFile(path: String, minPartitions: Int = 
 defaultMinPartitions,encoding: String = defaultEncoding): RDD[String] = {
 hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text],
   minPartitions).map(pair = new String(pair._2.getBytes(), 0 , 
 pair._2.getLength(), encoding)).setName(path)
   }


 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext

2015-03-16 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-6356:


 Summary: Support the ROLLUP/CUBE/GROUPING SETS/grouping() in 
SQLContext
 Key: SPARK-6356
 URL: https://issues.apache.org/jira/browse/SPARK-6356
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yadong Qi


Support for the expression below:
```
GROUP BY expression list WITH ROLLUP
GROUP BY expression list WITH CUBE
GROUP BY expression list GROUPING SETS (expression list2)
```
And
```
GROUP BY ROLLUP(expression list)
GROUP BY CUBE(expression list)
GROUP BY expression list GROUPING SETS(expression list2)
```
And
```
GROUPING (expression list)
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext

2015-03-16 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-6356:
-
Description: 
Support for the expression below:
```
GROUP BY expression list WITH ROLLUP
GROUP BY expression list WITH CUBE
GROUP BY expression list GROUPING SETS (expression list2)
```
And
```
GROUP BY ROLLUP(expression list)
GROUP BY CUBE(expression list)
GROUP BY expression list GROUPING SETS(expression list2)
```
And
```
GROUPING (expression list)
```


  was:
Support for the expression below:
`
GROUP BY expression list WITH ROLLUP
GROUP BY expression list WITH CUBE
GROUP BY expression list GROUPING SETS (expression list2)
`
And
```
GROUP BY ROLLUP(expression list)
GROUP BY CUBE(expression list)
GROUP BY expression list GROUPING SETS(expression list2)
```
And
```
GROUPING (expression list)
```



 Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext
 --

 Key: SPARK-6356
 URL: https://issues.apache.org/jira/browse/SPARK-6356
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yadong Qi

 Support for the expression below:
 ```
 GROUP BY expression list WITH ROLLUP
 GROUP BY expression list WITH CUBE
 GROUP BY expression list GROUPING SETS (expression list2)
 ```
 And
 ```
 GROUP BY ROLLUP(expression list)
 GROUP BY CUBE(expression list)
 GROUP BY expression list GROUPING SETS(expression list2)
 ```
 And
 ```
 GROUPING (expression list)
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2015-03-16 Thread Lior Chaga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362996#comment-14362996
 ] 

Lior Chaga commented on SPARK-6305:
---

Works by adding log4j 2.x jars with log4j1.2-api bridge to the classpath with 
SPARK_CLASSPATH. 
No need for changing spark distribution. Closed the pull request.

 Add support for log4j 2.x to Spark
 --

 Key: SPARK-6305
 URL: https://issues.apache.org/jira/browse/SPARK-6305
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Tal Sliwowicz

 log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
 classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >