[jira] [Commented] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-30 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982070#comment-14982070
 ] 

DB Tsai commented on SPARK-11385:
-

++Sean.

I think `foreach` doesn't make sense for sparse vector. In fact, I thought 
we're going to make `foreachActive` as public api which has been widely used 
and stable in this PR.

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-30 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982135#comment-14982135
 ] 

holdenk commented on SPARK-11138:
-

Makes sense, I guess I'll wait for it to fail again and grab the build logs 
before they get cleaned up :)

> Flaky pyspark test: test_add_py_file
> 
>
> Key: SPARK-11138
> URL: https://issues.apache.org/jira/browse/SPARK-11138
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>  Labels: flaky-test
>
> This test fails pretty often when running PR tests. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console
> {noformat}
> ==
> ERROR: test_add_py_file (__main__.AddFileTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 396, in test_add_py_file
> res = self.sc.parallelize(range(2)).map(func).first()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1315, in first
> rs = self.take(1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1297, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py",
>  line 923, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
> self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
> in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 
> (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 111, in main
> process()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1293, in takeUpToNumLeft
> yield next(iterator)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 388, in func
> from userlibrary import UserClass
> ImportError: cannot import name UserClass
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1414)
>   at 
> 

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-30 Thread Saif Addin Ellafi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981981#comment-14981981
 ] 

Saif Addin Ellafi commented on SPARK-11330:
---

I can no longer reproduce the issue on 1.5.2-RC1 with the same steps. Compared 
with 1.5.1 back and forth in the same local environment. Could not test cluster 
mode yet, but will see if possible.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10342) Cooperative memory management

2015-10-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10342.

Resolution: Fixed

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-10-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10309.

Resolution: Fixed

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> *=== Update ===*
> This is caused by a mismatch between 
> `Runtime.getRuntime.availableProcessors()` and the number of active tasks in 
> `ShuffleMemoryManager`. A quick reproduction is the following:
> {code}
> // My machine only has 8 cores
> $ bin/spark-shell --master local[32]
> scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
> scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
> Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:120)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> {code}
> *=== Original ===*
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-30 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982030#comment-14982030
 ] 

Jacek Lewandowski edited comment on SPARK-11402 at 10/30/15 7:18 AM:
-

That's right, this is not only introduction of pluggable executor and driver 
runners. However, at least some of the changes are related to coding practises, 
which i believe are good:
- {{WorkerPage}} does use neither {{ExecutorRunner}} nor {{DriverRunner}}, but 
traits which expose only the methods to read information
- There is a functionality related to executor and driver runners which is 
common to both of them. By introducing interfaces I made this explicit - from 
scheduling point of view, either executor or driver are some services which 
consume resources so as far as scheduling is concerned, they should be 
generalised and handled uniformly
- the number of arguments which were passed to runner constructors grew to 
ridiculous numbers so the related settings were collected in dedicates case 
classes - those which represent the worker configuration and those which 
represent the process configuration
- fixed some inconsistency - the driver main directory was created inside the 
driver runner while executor directory was created in the worker - this was 
unified

Do you think the aforementioned changes are wrong? I don't believe they 
increase the maintenance and support effort because the refactoring i proposed 
is towards better code reuse, separation of concerns and encapsulation. The 
ability to plug a custom runner implementation is just a small part of this 
whole patch (and afaik using factories is also a good practice).



was (Author: jlewandowski):
That's right, this is not only introduction of pluggable executor and driver 
runners. However, at least some of the changes are related to coding practises, 
which i believe are good:
- {{WorkerPage}} does use neither {{ExecutorRunner}} nor {{DriverRunner}}, but 
traits which expose only the methods to read information
- There is a functionality related to executor and driver runners which is 
common to both of them. By introducing interfaces I made this explicit - from 
scheduling point of view, either executor or driver are some services which 
consume resources so as far as scheduling is concerned, they should be 
generalised and handled in uniformly
- the number of arguments which were passed to runner constructors grew to 
ridiculous numbers so the related settings were collected in dedicates case 
classes - those which represent the worker configuration and those which 
represent the process configuration
- fixed some inconsistency - the driver main directory was created inside the 
driver runner while executor directory was created in the worker - this was 
unified

Do you think the aforementioned changes are wrong? I don't believe they 
increase the maintenance and support effort because the refactoring i proposed 
is towards better code reuse, separation of concerns and encapsulation. The 
ability to plug a custom runner implementation is just a small part of this 
whole patch (and afaik using factories is also a good practice).


> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-30 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-11385:

Assignee: Nakul Jindal

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11385) Make foreachActive public in MLLib's vector API

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982090#comment-14982090
 ] 

Sean Owen commented on SPARK-11385:
---

OK the new description makes more sense, yes

> Make foreachActive public in MLLib's vector API
> ---
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
>
> Make foreachActive public in MLLib's vector API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982145#comment-14982145
 ] 

Sean Owen commented on SPARK-11416:
---

That's good, but it doesn't mean it doesn't break other things. I assume tests 
pass. It may happen to continue to work for you. But an upgrade this 
significant needs more study: what dependencies changed if any? what behavior 
changed? that is, have a look through the release notes and make that argument.

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-10-30 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981966#comment-14981966
 ] 

Alan Braithwaite commented on SPARK-3630:
-

Of course!  I can provide more information tomorrow but my experience is mostly 
anecdotal.  That is, I was using the default partitioner and encountered this 
issue and when I switched to the hash partitioner it went away.

This is what I remember off the top of my head:

Spark 1.5
Mesos 0.23.1

I don't have the stacktrace on me, but I remember it started the same as this 
above:

{code}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
{code}

I'll set a reminder to get the rest of this to you tomorrow!

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-10-30 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981966#comment-14981966
 ] 

Alan Braithwaite edited comment on SPARK-3630 at 10/30/15 6:08 AM:
---

Of course!  I can provide more information tomorrow but my experience is mostly 
anecdotal.  That is, I was using the default partitioner (sort) when I 
encountered this issue and when I switched to the hash partitioner it went 
away.  Using snappy in both cases afaict.

This is what I remember off the top of my head:

Spark 1.5
Mesos 0.23.1

I don't have the stacktrace on me, but I remember it started the same as this 
above:

{code}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
{code}

I'll set a reminder to get the rest of this to you tomorrow!


was (Author: abraithwaite):
Of course!  I can provide more information tomorrow but my experience is mostly 
anecdotal.  That is, I was using the default partitioner and encountered this 
issue and when I switched to the hash partitioner it went away.

This is what I remember off the top of my head:

Spark 1.5
Mesos 0.23.1

I don't have the stacktrace on me, but I remember it started the same as this 
above:

{code}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
{code}

I'll set a reminder to get the rest of this to you tomorrow!

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981968#comment-14981968
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11416) Upgrade kryo package to version 3.0

2015-10-30 Thread Hitoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitoshi Ozawa updated SPARK-11416:
--
Description: Would like to have Apache Spark upgrade kryo package from 2.x 
(current) to 3.x.(was: Would like to have Apache Spark upgrade kyro package 
from 2.x (current) to 3.x.  )
Summary: Upgrade kryo package to version 3.0  (was: Upgrade kyro 
package to version 3.0)

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11386) Refactor appropriate uses of Vector to use the new foreach API

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-11386.
---
Resolution: Fixed

> Refactor appropriate uses of Vector to use the new foreach API
> --
>
> Key: SPARK-11386
> URL: https://issues.apache.org/jira/browse/SPARK-11386
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Once SPARK-11385 - Add foreach API to MLLib's vector API  is in look for 
> places where it should be used internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11419) WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11419:


Assignee: Apache Spark

> WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled
> ---
>
> Key: SPARK-11419
> URL: https://issues.apache.org/jira/browse/SPARK-11419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The support for closing WriteAheadLog files after writes was just merged in. 
> Closing every file after a write is a very expensive operation as it creates 
> many small files on S3. It's not necessary to enable it on HDFS anyway.
> However, when you have many small files on S3, recovery takes very long. We 
> can parallelize the recovery process.
> In addition, files start stacking up pretty quickly, and deletes may not be 
> able to keep up, therefore we should add support for that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11419) WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11419:


Assignee: (was: Apache Spark)

> WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled
> ---
>
> Key: SPARK-11419
> URL: https://issues.apache.org/jira/browse/SPARK-11419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> The support for closing WriteAheadLog files after writes was just merged in. 
> Closing every file after a write is a very expensive operation as it creates 
> many small files on S3. It's not necessary to enable it on HDFS anyway.
> However, when you have many small files on S3, recovery takes very long. We 
> can parallelize the recovery process.
> In addition, files start stacking up pretty quickly, and deletes may not be 
> able to keep up, therefore we should add support for that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11419) WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled

2015-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981975#comment-14981975
 ] 

Apache Spark commented on SPARK-11419:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9373

> WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled
> ---
>
> Key: SPARK-11419
> URL: https://issues.apache.org/jira/browse/SPARK-11419
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> The support for closing WriteAheadLog files after writes was just merged in. 
> Closing every file after a write is a very expensive operation as it creates 
> many small files on S3. It's not necessary to enable it on HDFS anyway.
> However, when you have many small files on S3, recovery takes very long. We 
> can parallelize the recovery process.
> In addition, files start stacking up pretty quickly, and deletes may not be 
> able to keep up, therefore we should add support for that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981968#comment-14981968
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:20 AM:
-

Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

Actually i'm not sure about this now...


was (Author: rspitzer):
Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11385) Add foreach API to MLLib's vector API

2015-10-30 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982073#comment-14982073
 ] 

holdenk commented on SPARK-11385:
-

ah my misunderstanding then.

> Add foreach API to MLLib's vector API
> -
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11385) Make foreachActive public in MLLib's vector API

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11385:

Description: Make foreachActive public in MLLib's vector API  (was: Add a 
foreach API to MLLib's vector.)

> Make foreachActive public in MLLib's vector API
> ---
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
>
> Make foreachActive public in MLLib's vector API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-30 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982103#comment-14982103
 ] 

holdenk commented on SPARK-11138:
-

So with a non logged in view I don't see it on 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/ (for 
example), I remember there being build artifacts at one point - but I'm not 
sure if they aren't public anymore or I just don't remember where to look for 
them.

> Flaky pyspark test: test_add_py_file
> 
>
> Key: SPARK-11138
> URL: https://issues.apache.org/jira/browse/SPARK-11138
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>  Labels: flaky-test
>
> This test fails pretty often when running PR tests. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console
> {noformat}
> ==
> ERROR: test_add_py_file (__main__.AddFileTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 396, in test_add_py_file
> res = self.sc.parallelize(range(2)).map(func).first()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1315, in first
> rs = self.take(1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1297, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py",
>  line 923, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
> self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
> in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 
> (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 111, in main
> process()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1293, in takeUpToNumLeft
> yield next(iterator)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 388, in func
> from userlibrary import UserClass
> ImportError: cannot import name UserClass
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at 

[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-10-30 Thread Hitoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982171#comment-14982171
 ] 

Hitoshi Ozawa commented on SPARK-11416:
---

OK. Will do more checks and come back when we have more information to provide.

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader

2015-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982037#comment-14982037
 ] 

Apache Spark commented on SPARK-11421:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9313

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11421) Add the ability to add a jar to the current class loader

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11421:


Assignee: (was: Apache Spark)

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11421) Add the ability to add a jar to the current class loader

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11421:


Assignee: Apache Spark

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10962) DataFrame "except" method...

2015-10-30 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982160#comment-14982160
 ] 

Herman van Hovell commented on SPARK-10962:
---

Do you want to know which row had a duplicate? If you do, you can also use a 
window function for this. For example
{noformat}
// Create a dataset.
val duplicateDf = sqlContext.range(1 << 20)
  .select(
($"id" - ($"id" % 2)).as("grp1"),
($"id" - ($"id" % 3)).as("grp2"))

// Count Unique records
duplicateDf.distinct.count // res1: Long = 699051

// Count Using window functions
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"grp1", $"grp2").orderBy($"grp1", $"grp2")
val deDuplicatedDf = duplicateDf
  .select(
$"*"
,rowNumber().over(window).as("selector")
,count(lit(1)).over(window).as("count"))
  .filter($"selector" === lit(1))

// Count Unique records with window function
deDuplicatedDf.count // res2: Long = 699051
{noformat}





> DataFrame "except" method...
> 
>
> Key: SPARK-10962
> URL: https://issues.apache.org/jira/browse/SPARK-10962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Abhijit Deb
>Priority: Critical
>
> We are trying to find the duplicates in a DataFrame. We first get the uniques 
> and then we are trying to get the duplicates using "except". While the 
> uniques is quite fast, but getting the duplicates using "except" is 
> tremendously slow. What will be the best way to get the duplicates - getting 
> just the uniques is not sufficient in most use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11420) Changing Stddev support with Imperative Aggregate

2015-10-30 Thread Jihong MA (JIRA)
Jihong MA created SPARK-11420:
-

 Summary: Changing Stddev support with Imperative Aggregate
 Key: SPARK-11420
 URL: https://issues.apache.org/jira/browse/SPARK-11420
 Project: Spark
  Issue Type: Improvement
  Components: ML, SQL
Reporter: Jihong MA


based on the performance comparison of Declaritive vs. Imperative Aggregate 
(SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11420) Changing Stddev support with Imperative Aggregate

2015-10-30 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11420:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-10384

> Changing Stddev support with Imperative Aggregate
> -
>
> Key: SPARK-11420
> URL: https://issues.apache.org/jira/browse/SPARK-11420
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> based on the performance comparison of Declaritive vs. Imperative Aggregate 
> (SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982024#comment-14982024
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 7:14 AM:
-

I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even PDT).


was (Author: rspitzer):
I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even UTC).

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982024#comment-14982024
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even UTC).

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974653#comment-14974653
 ] 

Jean-Baptiste Onofré edited comment on SPARK-11193 at 10/30/15 6:13 AM:


Thanks for the update. I'm in vacation up to Wednesday. I will resume my 
investigation/fix on this next Thursday. 


was (Author: jbonofre):
Hanks for thé update. I'm in vacation up to Wednesday. I will resume my 
investigation/fix on this next Thursday. 

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981972#comment-14981972
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

Oh ok, so it makes sense ;)

Let me fix that. Thanks guys !

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10929) Tungsten fails to acquire memory writing to HDFS

2015-10-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10929.

  Resolution: Fixed
Assignee: Davies Liu
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0  (was: 1.5.2, 1.6.0)

> Tungsten fails to acquire memory writing to HDFS
> 
>
> Key: SPARK-10929
> URL: https://issues.apache.org/jira/browse/SPARK-10929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Naden Franciscus
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We are executing 20 Spark SQL jobs in parallel using Spark Job Server and 
> hitting the following issue pretty routinely.
> 40GB heap x 6 nodes. Have tried adjusting shuffle.memoryFraction from 0.2 -> 
> 0.1 with no difference. 
> {code}
> .16): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:250)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to acquire 16777216 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:83)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:82)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
> at scala.collection.AbstractTraversable.collect(Traversable.scala:105)
> at 
> org.apache.spark.rdd.ZippedPartitionsBaseRDD.tryPrepareParents(ZippedPartitionsRDD.scala:82)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:97)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> {code}
> I have tried setting spark.buffer.pageSize to both 1MB and 64MB (in 
> spark-defaults.conf) and it makes no difference.
> It also tries to acquire 33554432 bytes of memory in both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-30 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982030#comment-14982030
 ] 

Jacek Lewandowski commented on SPARK-11402:
---

That's right, this is not only introduction of pluggable executor and driver 
runners. However, at least some of the changes are related to coding practises, 
which i believe are good:
- {{WorkerPage}} does use neither {{ExecutorRunner}} nor {{DriverRunner}}, but 
traits which expose only the methods to read information
- There is a functionality related to executor and driver runners which is 
common to both of them. By introducing interfaces I made this explicit - from 
scheduling point of view, either executor or driver are some services which 
consume resources so as far as scheduling is concerned, they should be 
generalised and handled in uniformly
- the number of arguments which were passed to runner constructors grew to 
ridiculous numbers so the related settings were collected in dedicates case 
classes - those which represent the worker configuration and those which 
represent the process configuration
- fixed some inconsistency - the driver main directory was created inside the 
driver runner while executor directory was created in the worker - this was 
unified

Do you think the aforementioned changes are wrong? I don't believe they 
increase the maintenance and support effort because the refactoring i proposed 
is towards better code reuse, separation of concerns and encapsulation. The 
ability to plug a custom runner implementation is just a small part of this 
whole patch (and afaik using factories is also a good practice).


> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982087#comment-14982087
 ] 

Marcelo Vanzin commented on SPARK-11138:


The logs should be available on the "Build Artifacts" for each build; unless 
the python logs end up somewhere else and are not being collected by jenkins.

> Flaky pyspark test: test_add_py_file
> 
>
> Key: SPARK-11138
> URL: https://issues.apache.org/jira/browse/SPARK-11138
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>  Labels: flaky-test
>
> This test fails pretty often when running PR tests. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console
> {noformat}
> ==
> ERROR: test_add_py_file (__main__.AddFileTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 396, in test_add_py_file
> res = self.sc.parallelize(range(2)).map(func).first()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1315, in first
> rs = self.take(1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1297, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py",
>  line 923, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
> self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
> in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 
> (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 111, in main
> process()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1293, in takeUpToNumLeft
> yield next(iterator)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 388, in func
> from userlibrary import UserClass
> ImportError: cannot import name UserClass
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> 

[jira] [Commented] (SPARK-10946) JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs

2015-10-30 Thread somil deshmukh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982138#comment-14982138
 ] 

somil deshmukh commented on SPARK-10946:


I would like to work on this .pls assign to me


> JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate 
> for DDLs
> --
>
> Key: SPARK-10946
> URL: https://issues.apache.org/jira/browse/SPARK-10946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.1
>Reporter: Pallavi Priyadarshini
>Priority: Minor
>
> Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under 
> the covers. Current code in DataFrameWriter and JDBCUtils uses 
> PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the 
> DDLs to fail against couple of databases that do not support prepares of DDLs.
> Can we use Statement.executeUpdate instead of 
> PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there 
> shouldn't be a performance impact.
> I can submit a PULL request if no one has objections.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10946) JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs

2015-10-30 Thread somil deshmukh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982140#comment-14982140
 ] 

somil deshmukh commented on SPARK-10946:


Pull request is

https://github.com/somideshmukh/spark-master-1.5.1/commit/820f8e2b3c9a75b42234d7be6b799a7d8ed059de

> JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate 
> for DDLs
> --
>
> Key: SPARK-10946
> URL: https://issues.apache.org/jira/browse/SPARK-10946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.1
>Reporter: Pallavi Priyadarshini
>Priority: Minor
>
> Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under 
> the covers. Current code in DataFrameWriter and JDBCUtils uses 
> PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the 
> DDLs to fail against couple of databases that do not support prepares of DDLs.
> Can we use Statement.executeUpdate instead of 
> PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there 
> shouldn't be a performance impact.
> I can submit a PULL request if no one has objections.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982001#comment-14982001
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:46 AM:
-

I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
{code}[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68){code}



was (Author: rspitzer):
I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68)


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982001#comment-14982001
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68)


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11413) Java 8 build has problem with joda-time and s3 request, should bump joda-time version

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982044#comment-14982044
 ] 

Sean Owen commented on SPARK-11413:
---

How about updating joda time to 2.8.1 everywhere? what are the compatibility 
issues, new dependencies, etc?
It doesn't seem to make sense to only do it for java 8.

> Java 8 build has problem with joda-time and s3 request, should bump joda-time 
> version
> -
>
> Key: SPARK-11413
> URL: https://issues.apache.org/jira/browse/SPARK-11413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Yongjia Wang
>Priority: Minor
>
> Joda-time has problems with formatting time zones starting with Java 1.8u60, 
> and this will cause s3 request to fail. It is said to have been fixed at 
> joda-time 2.8.1.
> Spark is still using joda-time 2.5 by fault, if java8 is used to build spark, 
> should set -Djoda.version=2.8.1 or above.
> I was hit by this problem, and -Djoda.version=2.9 worked.
> I don't see any reason not to bump up joda-time version in pom.xml
> Should I create a pull request for this? It is trivial.
> https://github.com/aws/aws-sdk-java/issues/484 
> https://github.com/aws/aws-sdk-java/issues/444
> http://stackoverflow.com/questions/32058431/aws-java-sdk-aws-authentication-requires-a-valid-date-or-x-amz-date-header



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982043#comment-14982043
 ] 

Sean Owen commented on SPARK-11416:
---

[~hozawa] as you may see from a number of other issues, it is not necessarily 
as simple as updating a version number, especially a major version. You would 
need to investigate incompatible changes both for Spark and end users. Is it 
fully backwards compatible? does it introduce new dependencies?

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11385) Make foreachActive public in MLLib's vector API

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11385:

Summary: Make foreachActive public in MLLib's vector API  (was: Add foreach 
API to MLLib's vector API)

> Make foreachActive public in MLLib's vector API
> ---
>
> Key: SPARK-11385
> URL: https://issues.apache.org/jira/browse/SPARK-11385
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
>
> Add a foreach API to MLLib's vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11416) Upgrade kryo package to version 3.0

2015-10-30 Thread Hitoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982094#comment-14982094
 ] 

Hitoshi Ozawa commented on SPARK-11416:
---

Sean, we have already tested and using kryo 3.0 with Spark in our environment. 
All we had to do was replace kryo file.

> Upgrade kryo package to version 3.0
> ---
>
> Key: SPARK-11416
> URL: https://issues.apache.org/jira/browse/SPARK-11416
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Hitoshi Ozawa
>Priority: Trivial
>
> Would like to have Apache Spark upgrade kryo package from 2.x (current) to 
> 3.x.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11402) Allow to define a custom driver runner and executor runner

2015-10-30 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982030#comment-14982030
 ] 

Jacek Lewandowski edited comment on SPARK-11402 at 10/30/15 8:04 AM:
-

That's right, this is not only introduction of pluggable executor and driver 
runners. However, at least some of the changes are related to coding practises, 
which i believe are good:
- {{WorkerPage}} does use neither {{ExecutorRunner}} nor {{DriverRunner}}, but 
traits which expose only the methods to read information
- There is a functionality related to executor and driver runners which is 
common to both of them. By introducing interfaces I made this explicit - from 
scheduling point of view, either executor or driver are some services which 
consume resources so as far as scheduling is concerned, they should be 
generalised and handled uniformly
- the number of arguments which were passed to runner constructors grew to 
ridiculous numbers so the related settings were collected in dedicates case 
classes - those which represent the worker configuration and those which 
represent the process configuration
- fixed some inconsistency - the driver main directory was created inside the 
driver runner while executor directory was created in the worker - this was 
unified

Do you think the aforementioned changes are wrong? I don't believe they 
increase the maintenance and support effort because the refactoring i proposed 
is towards better code reuse, separation of concerns and encapsulation. The 
ability to plug a custom runner implementation is just a small part of this 
whole patch (and afaik using factories is also a good practice - in this case 
it will also improve testability of Worker).



was (Author: jlewandowski):
That's right, this is not only introduction of pluggable executor and driver 
runners. However, at least some of the changes are related to coding practises, 
which i believe are good:
- {{WorkerPage}} does use neither {{ExecutorRunner}} nor {{DriverRunner}}, but 
traits which expose only the methods to read information
- There is a functionality related to executor and driver runners which is 
common to both of them. By introducing interfaces I made this explicit - from 
scheduling point of view, either executor or driver are some services which 
consume resources so as far as scheduling is concerned, they should be 
generalised and handled uniformly
- the number of arguments which were passed to runner constructors grew to 
ridiculous numbers so the related settings were collected in dedicates case 
classes - those which represent the worker configuration and those which 
represent the process configuration
- fixed some inconsistency - the driver main directory was created inside the 
driver runner while executor directory was created in the worker - this was 
unified

Do you think the aforementioned changes are wrong? I don't believe they 
increase the maintenance and support effort because the refactoring i proposed 
is towards better code reuse, separation of concerns and encapsulation. The 
ability to plug a custom runner implementation is just a small part of this 
whole patch (and afaik using factories is also a good practice).


> Allow to define a custom driver runner and executor runner
> --
>
> Key: SPARK-11402
> URL: https://issues.apache.org/jira/browse/SPARK-11402
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{DriverRunner}} and {{ExecutorRunner}} are used by Spark Worker in 
> standalone mode to spawn driver and executor processes respectively. When 
> integrating Spark with some environments, it would be useful to allow 
> providing a custom implementation of those components.
> The idea is simple - provide factory class names for driver and executor 
> runners in Worker configuration. By default, the current implementations are 
> used. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11138) Flaky pyspark test: test_add_py_file

2015-10-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982112#comment-14982112
 ] 

Marcelo Vanzin commented on SPARK-11138:


Hmm, interesting. I clicked on a few recent builds before I replied and I saw 
logs in those. Maybe the one above has been cleaned up or something?

> Flaky pyspark test: test_add_py_file
> 
>
> Key: SPARK-11138
> URL: https://issues.apache.org/jira/browse/SPARK-11138
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>  Labels: flaky-test
>
> This test fails pretty often when running PR tests. For example:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console
> {noformat}
> ==
> ERROR: test_add_py_file (__main__.AddFileTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 396, in test_add_py_file
> res = self.sc.parallelize(range(2)).map(func).first()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1315, in first
> rs = self.take(1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1297, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py",
>  line 923, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
> self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 
> in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 
> (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 111, in main
> process()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py",
>  line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 263, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", 
> line 1293, in takeUpToNumLeft
> yield next(iterator)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", 
> line 388, in func
> from userlibrary import UserClass
> ImportError: cannot import name UserClass
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> 

[jira] [Created] (SPARK-11419) WriteAheadLog recovery improvements for when closeFileAfterWrite is enabled

2015-10-30 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-11419:
---

 Summary: WriteAheadLog recovery improvements for when 
closeFileAfterWrite is enabled
 Key: SPARK-11419
 URL: https://issues.apache.org/jira/browse/SPARK-11419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz


The support for closing WriteAheadLog files after writes was just merged in. 
Closing every file after a write is a very expensive operation as it creates 
many small files on S3. It's not necessary to enable it on HDFS anyway.

However, when you have many small files on S3, recovery takes very long. We can 
parallelize the recovery process.

In addition, files start stacking up pretty quickly, and deletes may not be 
able to keep up, therefore we should add support for that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11421) Add the ability to add a jar to the current class loader

2015-10-30 Thread holdenk (JIRA)
holdenk created SPARK-11421:
---

 Summary: Add the ability to add a jar to the current class loader
 Key: SPARK-11421
 URL: https://issues.apache.org/jira/browse/SPARK-11421
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Minor


addJar add's jars for future operations, but could also add to the current 
class loader, this would be really useful in Python & R most likely where some 
included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11421) Add the ability to add a jar to the current class loader

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11421:

Component/s: Spark Core

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11417) @Override is not supported by older version of Janino

2015-10-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11417.

   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9372
[https://github.com/apache/spark/pull/9372]

> @Override is not supported by older version of Janino
> -
>
> Key: SPARK-11417
> URL: https://issues.apache.org/jira/browse/SPARK-11417
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0, 1.5.2
>
>
> {code}
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
> Invalid character input "@" (character code 64)
> public SpecificOrdering 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
>   return new SpecificOrdering(expr);
> }
> class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
>   
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   
>   
>   
>   public 
> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
> {
> expressions = expr;
> 
>   }
>   
>   @Override
>   public int compare(InternalRow a, InternalRow b) {
> InternalRow i = null;  // Holds current row being evaluated.
> 
> i = a;
> boolean isNullA2;
> long primitiveA3;
> {
>   /* input[2, LongType] */
>   
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>   
>   isNullA2 = isNull0;
>   primitiveA3 = primitive1;
> }
> i = b;
> boolean isNullB4;
> long primitiveB5;
> {
>   /* input[2, LongType] */
>   
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>   
>   isNullB4 = isNull0;
>   primitiveB5 = primitive1;
> }
> if (isNullA2 && isNullB4) {
>   // Nothing
> } else if (isNullA2) {
>   return 1;
> } else if (isNullB4) {
>   return -1;
> } else {
>   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
> -1 : 0);
>   if (comp != 0) {
> return -comp;
>   }
> }
> 
> return 0;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11386) Refactor appropriate uses of Vector to use the new foreach API

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reopened SPARK-11386:
-

closed with wrong status

> Refactor appropriate uses of Vector to use the new foreach API
> --
>
> Key: SPARK-11386
> URL: https://issues.apache.org/jira/browse/SPARK-11386
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Once SPARK-11385 - Add foreach API to MLLib's vector API  is in look for 
> places where it should be used internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11386) Refactor appropriate uses of Vector to use the new foreach API

2015-10-30 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-11386.
---
Resolution: Invalid

> Refactor appropriate uses of Vector to use the new foreach API
> --
>
> Key: SPARK-11386
> URL: https://issues.apache.org/jira/browse/SPARK-11386
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> Once SPARK-11385 - Add foreach API to MLLib's vector API  is in look for 
> places where it should be used internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11426) JDBCRDD should use JdbcDialect to probe for the existence of a table

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11426:
--
Priority: Minor  (was: Major)

Please set component. The specific change you are proposing is just fixing a 
dialect correctness problem and doesn't change whether the table name is 
escaped.

> JDBCRDD should use JdbcDialect to probe for the existence of a table
> 
>
> Key: SPARK-11426
> URL: https://issues.apache.org/jira/browse/SPARK-11426
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Rick Hillegas
>Priority: Minor
>
> JdbcDialect.getTableExistsQuery() lets us specify which query to use in order 
> to probe for the existence of a table. Some JdbcDialects (Postgres and MySQL) 
> override the default query used for this purpose. Presumably, we should use 
> those preferred overrides every time that Spark probes for the existence of a 
> query.
> However, JDBCRDD.resolveTable() has its own, hard-coded query which probes 
> for the existence of a table. We should make JDBCRDD.resolveTable() call 
> JdbcDialect.getTableExistsQuery() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2015-10-30 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983278#comment-14983278
 ] 

holdenk commented on SPARK-8586:


I've got a related fix in https://issues.apache.org/jira/browse/SPARK-11421 we 
could maybe try and use it here (if that gets merged in)

> SQL add jar command does not work well with Scala REPL
> --
>
> Key: SPARK-8586
> URL: https://issues.apache.org/jira/browse/SPARK-8586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
> So, SerDe added through add jar command may not be loaded in the context 
> class loader when we lookup the table.
> For example, the following code will fail when we try to show the table. 
> {code}
> hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
> hive.sql("drop table if exists jsonTable")
> hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe'")
> hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
> "val").insertInto("jsonTable")
> hive.table("jsonTable").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11430) DataFrame's except method does not work, returns 0

2015-10-30 Thread Ram Kandasamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983437#comment-14983437
 ] 

Ram Kandasamy commented on SPARK-11430:
---

Thanks Sean, will do in the future

> DataFrame's except method does not work, returns 0
> --
>
> Key: SPARK-11430
> URL: https://issues.apache.org/jira/browse/SPARK-11430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
>
> This may or may not be related to this bug here: 
> https://issues.apache.org/jira/browse/SPARK-11427
> But basically, the except method in dataframes should mirror the 
> functionality of the subtract method in RDDs, but it is not doing so.
> Here is an example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> val secondFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-10-23/*").select("id").distinct
> secondFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res1: Long = 1072046
> scala> secondFile.count
> res2: Long = 3569941
> scala> firstFile.except(secondFile).count
> res3: Long = 0
> scala> firstFile.rdd.subtract(secondFile.rdd).count
> res4: Long = 1072046
> Can anyone help out here? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11392) GroupedIterator's hasNext is not idempotent

2015-10-30 Thread Nakul Jindal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983487#comment-14983487
 ] 

Nakul Jindal commented on SPARK-11392:
--

[~yhuai], [~cloud_fan] - SPARK-11393 works around the problem the mentioned in 
this JIRA. Would we need to revert back the changes made by the associates PR 
if this JIRA were to be resolved?

> GroupedIterator's hasNext is not idempotent
> ---
>
> Key: SPARK-11392
> URL: https://issues.apache.org/jira/browse/SPARK-11392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> If we call 
> [GroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala]'s
>  {{hasNext}} immediately after its {{next}}, we will generate an extra group 
> ([CoGroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CoGroupedIterator.scala]
>  has this behavior). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11010) Fixes and enhancements addressing UDTs' api and several usability concerns

2015-10-30 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983237#comment-14983237
 ] 

Rick Hillegas commented on SPARK-11010:
---

Linking this issue to SPARK-10855. The api for declaring UDTs should make it 
possible to bind a Spark-side class or wrapper to a user-defined Java class 
which is stored in an external RDBMS and whose JDBC type is represented as 
java.sql.Types.JAVA_OBJECT. 

> Fixes and enhancements addressing UDTs' api and several usability concerns
> --
>
> Key: SPARK-11010
> URL: https://issues.apache.org/jira/browse/SPARK-11010
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: John Muller
>  Labels: UDT
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Catalyst SQL types allow for UserDefinedTypes, but do not allow for easy 
> extension of 3rd party types or extensions to built-in types like DecimalType 
> or StringType (private classes).
> Additionally, the API can infer much more of what's needed from the type 
> parameter than it currently does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11423) Remove PrepareRDD

2015-10-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11423.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9381
[https://github.com/apache/spark/pull/9381]

> Remove PrepareRDD 
> --
>
> Key: SPARK-11423
> URL: https://issues.apache.org/jira/browse/SPARK-11423
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> Since SPARK-10342 is resolved, MapPartitionWithPrepare is not needed anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11432) Personalized PageRank shouldn't use uniform initialization

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11432:


Assignee: Apache Spark

> Personalized PageRank shouldn't use uniform initialization
> --
>
> Key: SPARK-11432
> URL: https://issues.apache.org/jira/browse/SPARK-11432
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.1
>Reporter: Yves Raimond
>Assignee: Apache Spark
>Priority: Minor
>
> The current implementation of personalized pagerank in GraphX uses uniform 
> initialization over the full graph - every vertex will get initially 
> activated.
> For example:
> {code}
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> val users: RDD[(VertexId, (String, String))] =
>   sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", 
> "postdoc")),
>(5L, ("franklin", "prof")), (2L, ("istoica", "prof"
> val relationships: RDD[Edge[String]] =
>   sc.parallelize(Array(Edge(3L, 7L, "collab"),Edge(5L, 3L, "advisor"),
>Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> val defaultUser = ("John Doe", "Missing")
> val graph = Graph(users, relationships, defaultUser)
> graph.staticPersonalizedPageRank(3L, 0, 
> 0.15).vertices.collect.foreach(println)
> {code}
> Leads to all vertices being set to resetProb (0.15), which is different from 
> the behavior described in SPARK-5854, where only the source node should be 
> activated. 
> The risk is that, after a few iterations, the most activated nodes are the 
> source node and the nodes that were untouched by the propagation. For example 
> in the above example the vertex 2L will always have an activation of 0.15:
> {code}
> graph.personalizedPageRank(3L, 0, 0.15).vertices.collect.foreach(println)
> {code}
> Which leads into a higher score for 2L than for 7L and 5L, even though 
> there's no outbound path from 3L to 2L.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11432) Personalized PageRank shouldn't use uniform initialization

2015-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983544#comment-14983544
 ] 

Apache Spark commented on SPARK-11432:
--

User 'moustaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/9386

> Personalized PageRank shouldn't use uniform initialization
> --
>
> Key: SPARK-11432
> URL: https://issues.apache.org/jira/browse/SPARK-11432
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.1
>Reporter: Yves Raimond
>Priority: Minor
>
> The current implementation of personalized pagerank in GraphX uses uniform 
> initialization over the full graph - every vertex will get initially 
> activated.
> For example:
> {code}
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> val users: RDD[(VertexId, (String, String))] =
>   sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", 
> "postdoc")),
>(5L, ("franklin", "prof")), (2L, ("istoica", "prof"
> val relationships: RDD[Edge[String]] =
>   sc.parallelize(Array(Edge(3L, 7L, "collab"),Edge(5L, 3L, "advisor"),
>Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> val defaultUser = ("John Doe", "Missing")
> val graph = Graph(users, relationships, defaultUser)
> graph.staticPersonalizedPageRank(3L, 0, 
> 0.15).vertices.collect.foreach(println)
> {code}
> Leads to all vertices being set to resetProb (0.15), which is different from 
> the behavior described in SPARK-5854, where only the source node should be 
> activated. 
> The risk is that, after a few iterations, the most activated nodes are the 
> source node and the nodes that were untouched by the propagation. For example 
> in the above example the vertex 2L will always have an activation of 0.15:
> {code}
> graph.personalizedPageRank(3L, 0, 0.15).vertices.collect.foreach(println)
> {code}
> Which leads into a higher score for 2L than for 7L and 5L, even though 
> there's no outbound path from 3L to 2L.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11382) Replace example code in mllib-decision-tree.md/mllib-ensembles.md using include_example

2015-10-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11382:
--
Shepherd: Xusen Yin

> Replace example code in mllib-decision-tree.md/mllib-ensembles.md using 
> include_example
> ---
>
> Key: SPARK-11382
> URL: https://issues.apache.org/jira/browse/SPARK-11382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11418) Add link to the source file at the end of included example code

2015-10-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-11418.
-
Resolution: Duplicate

> Add link to the source file at the end of included example code
> ---
>
> Key: SPARK-11418
> URL: https://issues.apache.org/jira/browse/SPARK-11418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> It is useful to put a link to the source file after `{% highlight %}` code 
> block, so users can quickly locate the source file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983569#comment-14983569
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983570#comment-14983570
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983568#comment-14983568
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11300) Support for string length when writing to JDBC

2015-10-30 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983631#comment-14983631
 ] 

Suresh Thalamati edited comment on SPARK-11300 at 10/30/15 11:59 PM:
-

Another related issue is SPARK-10849, Fix will  allow users to override data 
type  for any field. 


was (Author: tsuresh):
Another related issue is Spark-10849, Fix will  allow users to override data 
type  for any field. 

> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-30 Thread Tycho Grouwstra (JIRA)
Tycho Grouwstra created SPARK-11431:
---

 Summary: Allow exploding arrays of structs in DataFrames
 Key: SPARK-11431
 URL: https://issues.apache.org/jira/browse/SPARK-11431
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Tycho Grouwstra


I am creating DataFrames from some [JSON 
data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
explode an array of structs (as are common in JSON) to their own rows so I 
could start analyzing the data using GraphX. I believe many others might have 
use for this as well, since most web data is in JSON format.

This feature would build upon the existing `explode` functionality added to 
DataFrames by [~marmbrus], which currently errors when you call it on such 
arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
function to infer column types -- this approach is insufficient in the case of 
Rows, since their type does not contain the required info. The alternative here 
would be to instead grab the schema info from the existing schema for such 
cases.

I'm trying to implement a patch that might add this functionality, so stay 
tuned until I've figured that out. I'm new here though so I'll probably have 
use for some feedback...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11392) GroupedIterator's hasNext is not idempotent

2015-10-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983490#comment-14983490
 ] 

Yin Huai commented on SPARK-11392:
--

I think it is not necessary to revert that. But, we may want to update some 
comments if the behavior of GroupedIterator's hasNext has been changed.

> GroupedIterator's hasNext is not idempotent
> ---
>
> Key: SPARK-11392
> URL: https://issues.apache.org/jira/browse/SPARK-11392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> If we call 
> [GroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/GroupedIterator.scala]'s
>  {{hasNext}} immediately after its {{next}}, we will generate an extra group 
> ([CoGroupedIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CoGroupedIterator.scala]
>  has this behavior). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11433) [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11433:


Assignee: (was: Apache Spark)

> [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers
> -
>
> Key: SPARK-11433
> URL: https://issues.apache.org/jira/browse/SPARK-11433
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> In the rule EliminateSubQueries, the current implementation just removes the 
> node subquery. 
> However, the qualifiers of its parent Project node could still record the 
> subquery's alias name, although the subquery has been removed.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11433) [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers

2015-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983489#comment-14983489
 ] 

Apache Spark commented on SPARK-11433:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/9385

> [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers
> -
>
> Key: SPARK-11433
> URL: https://issues.apache.org/jira/browse/SPARK-11433
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> In the rule EliminateSubQueries, the current implementation just removes the 
> node subquery. 
> However, the qualifiers of its parent Project node could still record the 
> subquery's alias name, although the subquery has been removed.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11425) Improve hybrid aggregation (sort-based after hash-based)

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11425:
--
Component/s: Spark Core

> Improve hybrid aggregation (sort-based after hash-based)
> 
>
> Key: SPARK-11425
> URL: https://issues.apache.org/jira/browse/SPARK-11425
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>
> After aggregation, the dataset could be smaller than inputs, so it's better 
> to do hash based aggregation for all inputs, then using sort based 
> aggregation to merge them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11275) [SQL] Regression in rollup/cube

2015-10-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983324#comment-14983324
 ] 

Xiao Li commented on SPARK-11275:
-

Hi, Andrew, 

Expression is not the root cause. The implementation of cube/rollup is 
incomplete. I am trying to test all the edge cases. Also working on the fix. 
Unfortunately, when I fix this problem, I keep finding more bugs in the code 
base. 

Thanks, 

Xiao 


> [SQL] Regression in rollup/cube 
> 
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11428) Schema Merging Broken for Some Queries

2015-10-30 Thread Brad Willard (JIRA)
Brad Willard created SPARK-11428:


 Summary: Schema Merging Broken for Some Queries
 Key: SPARK-11428
 URL: https://issues.apache.org/jira/browse/SPARK-11428
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.5.1
 Environment: AWS,
Reporter: Brad Willard


I have data being written into parquet format via spark streaming. The data can 
change slightly so schema merging is required. I load a dataframe like this

{code}
urls = [
"/streaming/parquet/events/key=2015-10-30*",
"/streaming/parquet/events/key=2015-10-29*"
]

sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls)
sdf.registerTempTable('events')
{code}

If I print the schema you can see the contested column

{code}
sdf.printSchema()


root
 |-- _id: string (nullable = true)
...
 |-- d__device_s: string (nullable = true)
 |-- d__isActualPageLoad_s: string (nullable = true)
 |-- d__landing_s: string (nullable = true)
 |-- d__lang_s: string (nullable = true)
 |-- d__os_s: string (nullable = true)
 |-- d__performance_i: long (nullable = true)
 |-- d__product_s: string (nullable = true)
 |-- d__refer_s: string (nullable = true)
 |-- d__rk_i: long (nullable = true)
 |-- d__screen_s: string (nullable = true)
 |-- d__submenuName_s: string (nullable = true)
{code}

The column that's in one but not the other file is  d__product_s

So I'm able to run this query and it works fine.
{code}
sql_context.sql('''
select 
distinct(d__product_s) 
from 
events
where 
n = 'view'
''').collect()

[Row(d__product_s=u'website'),
 Row(d__product_s=u'store'),
 Row(d__product_s=None),
 Row(d__product_s=u'page')]

{code}

However if I instead use that column in the where clause things break.

{code}
sql_context.sql('''
select 
* 
from 
events
where 
n = 'view' and d__product_s = 'page'
''').take(1)

---
Py4JJavaError Traceback (most recent call last)
 in ()
  6 where
  7 n = 'frontsite_view' and d__product_s = 'page'
> 8 ''').take(1)

/root/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
303 with SCCallSiteSync(self._sc) as css:
304 port = 
self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
--> 305 self._jdf, num)
306 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
307 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
 34 def deco(*a, **kw):
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:
 38 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage 15.0 
(TID 6536, 10.149.1.168): java.lang.IllegalArgumentException: Column 
[d__product_s] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
at 
org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:131)
at 

[jira] [Created] (SPARK-11427) DataFrame's intersect method does not work, returns 1

2015-10-30 Thread Ram Kandasamy (JIRA)
Ram Kandasamy created SPARK-11427:
-

 Summary: DataFrame's intersect method does not work, returns 1
 Key: SPARK-11427
 URL: https://issues.apache.org/jira/browse/SPARK-11427
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Ram Kandasamy


Hello,

I was working with dataframes and I found the intersect() method seems to 
always return '1'. The RDD's intersection() method does work properly.

Consider this example:
scala> val firstFile = 
sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
firstFile: org.apache.spark.sql.DataFrame = [id: string]

scala> firstFile.count
res4: Long = 1072046

scala> firstFile.intersect(firstFile).count
res5: Long = 1

scala> firstFile.rdd.intersection(firstFile.rdd).count
res6: Long = 1072046


I have tried various different cases, and for some reason, the dataframe's 
intersect method always returns 1. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11348.
---
Resolution: Duplicate

This was already covered by JIRAs like 
https://issues.apache.org/jira/browse/SPARK-10342

It's not a problem anyway. If you're going to clean up deprecations, go ahead 
and try to do all of them.

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11430) DataFrame's except method does not work, returns 0

2015-10-30 Thread Ram Kandasamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983416#comment-14983416
 ] 

Ram Kandasamy commented on SPARK-11430:
---

https://issues.apache.org/jira/browse/SPARK-10539 looks to have resolved this 
in 1.5.1, closing this

> DataFrame's except method does not work, returns 0
> --
>
> Key: SPARK-11430
> URL: https://issues.apache.org/jira/browse/SPARK-11430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
> Fix For: 1.5.1
>
>
> This may or may not be related to this bug here: 
> https://issues.apache.org/jira/browse/SPARK-11427
> But basically, the except method in dataframes should mirror the 
> functionality of the subtract method in RDDs, but it is not doing so.
> Here is an example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> val secondFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-10-23/*").select("id").distinct
> secondFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res1: Long = 1072046
> scala> secondFile.count
> res2: Long = 3569941
> scala> firstFile.except(secondFile).count
> res3: Long = 0
> scala> firstFile.rdd.subtract(secondFile.rdd).count
> res4: Long = 1072046
> Can anyone help out here? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11433) [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11433:


Assignee: Apache Spark

> [SQL] Rule EliminateSubQueries does not clean the parent Project's qualifiers
> -
>
> Key: SPARK-11433
> URL: https://issues.apache.org/jira/browse/SPARK-11433
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> In the rule EliminateSubQueries, the current implementation just removes the 
> node subquery. 
> However, the qualifiers of its parent Project node could still record the 
> subquery's alias name, although the subquery has been removed.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames

2015-10-30 Thread Tycho Grouwstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983488#comment-14983488
 ] 

Tycho Grouwstra commented on SPARK-11431:
-

One might wonder if the similar-sounding issue SPARK-7734 ("make explode 
support struct type") is related to this. That one concerned splitting structs 
into multiple columns though. That's relevant here as well, but the issue here 
pertains to splitting arrays over rows instead (as in the existing `explode` 
function).

> Allow exploding arrays of structs in DataFrames
> ---
>
> Key: SPARK-11431
> URL: https://issues.apache.org/jira/browse/SPARK-11431
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tycho Grouwstra
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON 
> data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to 
> explode an array of structs (as are common in JSON) to their own rows so I 
> could start analyzing the data using GraphX. I believe many others might have 
> use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to 
> DataFrames by [~marmbrus], which currently errors when you call it on such 
> arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor 
> function to infer column types -- this approach is insufficient in the case 
> of Rows, since their type does not contain the required info. The alternative 
> here would be to instead grab the schema info from the existing schema for 
> such cases.
> I'm trying to implement a patch that might add this functionality, so stay 
> tuned until I've figured that out. I'm new here though so I'll probably have 
> use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11432) Personalized PageRank shouldn't use uniform initialization

2015-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11432:


Assignee: (was: Apache Spark)

> Personalized PageRank shouldn't use uniform initialization
> --
>
> Key: SPARK-11432
> URL: https://issues.apache.org/jira/browse/SPARK-11432
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.1
>Reporter: Yves Raimond
>Priority: Minor
>
> The current implementation of personalized pagerank in GraphX uses uniform 
> initialization over the full graph - every vertex will get initially 
> activated.
> For example:
> {code}
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> val users: RDD[(VertexId, (String, String))] =
>   sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", 
> "postdoc")),
>(5L, ("franklin", "prof")), (2L, ("istoica", "prof"
> val relationships: RDD[Edge[String]] =
>   sc.parallelize(Array(Edge(3L, 7L, "collab"),Edge(5L, 3L, "advisor"),
>Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> val defaultUser = ("John Doe", "Missing")
> val graph = Graph(users, relationships, defaultUser)
> graph.staticPersonalizedPageRank(3L, 0, 
> 0.15).vertices.collect.foreach(println)
> {code}
> Leads to all vertices being set to resetProb (0.15), which is different from 
> the behavior described in SPARK-5854, where only the source node should be 
> activated. 
> The risk is that, after a few iterations, the most activated nodes are the 
> source node and the nodes that were untouched by the propagation. For example 
> in the above example the vertex 2L will always have an activation of 0.15:
> {code}
> graph.personalizedPageRank(3L, 0, 0.15).vertices.collect.foreach(println)
> {code}
> Which leads into a higher score for 2L than for 7L and 5L, even though 
> there's no outbound path from 3L to 2L.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11420) Updating Stddev support with Imperative Aggregate

2015-10-30 Thread Jihong MA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jihong MA updated SPARK-11420:
--
Summary: Updating Stddev support with Imperative Aggregate  (was: Changing 
Stddev support with Imperative Aggregate)

> Updating Stddev support with Imperative Aggregate
> -
>
> Key: SPARK-11420
> URL: https://issues.apache.org/jira/browse/SPARK-11420
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> based on the performance comparison of Declaritive vs. Imperative Aggregate 
> (SPARK-10953), switching to Imerative aggregate for stddev. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983272#comment-14983272
 ] 

Xiangrui Meng edited comment on SPARK-9836 at 10/30/15 8:42 PM:


Yes, this JIRA is only for the normal equation solver and linear regression. We 
don't need to add all statistics in a single PR. Let's add statistics that can 
be easily derived from `diag(A^T W A)` and the residuals.


was (Author: mengxr):
Yes, this JIRA is only for the normal equation solver and linear regression. We 
don't need to add all statistics in a single PR. Let's add statistics that can 
be easily derived from `diag(A^T W A)`.

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11340) Support setting driver properties when starting Spark from R programmatically or from RStudio

2015-10-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11340:
--
Assignee: Felix Cheung

> Support setting driver properties when starting Spark from R programmatically 
> or from RStudio
> -
>
> Key: SPARK-11340
> URL: https://issues.apache.org/jira/browse/SPARK-11340
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently when sparkR.init() is called in 'client' mode, it launches the JVM 
> backend but driver properties (like driver-memory) are not passed or settable 
> by the user calling sparkR.init().
> [~sunrui][~shivaram] and I discussed this offline and think we should support 
> this.
> This is the original thread:
> >> From: rui@intel.com
> >> To: dirceu.semigh...@gmail.com
> >> CC: u...@spark.apache.org
> >> Subject: RE: How to set memory for SparkR with master="local[*]"
> >> Date: Mon, 26 Oct 2015 02:24:00 +
> >>
> >> As documented in
> >> http://spark.apache.org/docs/latest/configuration.html#available-prop
> >> e
> >> rties,
> >>
> >> Note for “spark.driver.memory”:
> >>
> >> Note: In client mode, this config must not be set through the 
> >> SparkConf directly in your application, because the driver JVM has 
> >> already started at that point. Instead, please set this through the 
> >> --driver-memory command line option or in your default properties file.
> >>
> >>
> >>
> >> If you are to start a SparkR shell using bin/sparkR, then you can use 
> >> bin/sparkR –driver-memory. You have no chance to set the driver 
> >> memory size after the R shell has been launched via bin/sparkR.
> >>
> >>
> >>
> >> Buf if you are to start a SparkR shell manually without using 
> >> bin/sparkR (for example, in Rstudio), you can:
> >>
> >> library(SparkR)
> >>
> >> Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g
> >> sparkr-shell")
> >>
> >> sc <- sparkR.init()
> >>
> >>
> >>
> >> From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
> >> Sent: Friday, October 23, 2015 7:53 PM
> >> Cc: user
> >> Subject: Re: How to set memory for SparkR with master="local[*]"
> >>
> >>
> >>
> >> Hi Matej,
> >>
> >> I'm also using this and I'm having the same behavior here, my driver 
> >> has only 530mb which is the default value.
> >>
> >>
> >>
> >> Maybe this is a bug.
> >>
> >>
> >>
> >> 2015-10-23 9:43 GMT-02:00 Matej Holec :
> >>
> >> Hello!
> >>
> >> How to adjust the memory settings properly for SparkR with 
> >> master="local[*]"
> >> in R?
> >>
> >>
> >> *When running from  R -- SparkR doesn't accept memory settings :(*
> >>
> >> I use the following commands:
> >>
> >> R>  library(SparkR)
> >> R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
> >> list(spark.driver.memory = "5g"))
> >>
> >> Despite the variable spark.driver.memory is correctly set (checked in 
> >> http://node:4040/environment/), the driver has only the default 
> >> amount of memory allocated (Storage Memory 530.3 MB).
> >>
> >> *But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*
> >>
> >> The following command:
> >>
> >> ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g
> >>
> >> creates SparkR session with properly adjustest driver memory (Storage 
> >> Memory
> >> 2.6 GB).
> >>
> >>
> >> Any suggestion?
> >>
> >> Thanks
> >> Matej
> >>
> >>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11340) Support setting driver properties when starting Spark from R programmatically or from RStudio

2015-10-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11340.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9290
[https://github.com/apache/spark/pull/9290]

> Support setting driver properties when starting Spark from R programmatically 
> or from RStudio
> -
>
> Key: SPARK-11340
> URL: https://issues.apache.org/jira/browse/SPARK-11340
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently when sparkR.init() is called in 'client' mode, it launches the JVM 
> backend but driver properties (like driver-memory) are not passed or settable 
> by the user calling sparkR.init().
> [~sunrui][~shivaram] and I discussed this offline and think we should support 
> this.
> This is the original thread:
> >> From: rui@intel.com
> >> To: dirceu.semigh...@gmail.com
> >> CC: u...@spark.apache.org
> >> Subject: RE: How to set memory for SparkR with master="local[*]"
> >> Date: Mon, 26 Oct 2015 02:24:00 +
> >>
> >> As documented in
> >> http://spark.apache.org/docs/latest/configuration.html#available-prop
> >> e
> >> rties,
> >>
> >> Note for “spark.driver.memory”:
> >>
> >> Note: In client mode, this config must not be set through the 
> >> SparkConf directly in your application, because the driver JVM has 
> >> already started at that point. Instead, please set this through the 
> >> --driver-memory command line option or in your default properties file.
> >>
> >>
> >>
> >> If you are to start a SparkR shell using bin/sparkR, then you can use 
> >> bin/sparkR –driver-memory. You have no chance to set the driver 
> >> memory size after the R shell has been launched via bin/sparkR.
> >>
> >>
> >>
> >> Buf if you are to start a SparkR shell manually without using 
> >> bin/sparkR (for example, in Rstudio), you can:
> >>
> >> library(SparkR)
> >>
> >> Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g
> >> sparkr-shell")
> >>
> >> sc <- sparkR.init()
> >>
> >>
> >>
> >> From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
> >> Sent: Friday, October 23, 2015 7:53 PM
> >> Cc: user
> >> Subject: Re: How to set memory for SparkR with master="local[*]"
> >>
> >>
> >>
> >> Hi Matej,
> >>
> >> I'm also using this and I'm having the same behavior here, my driver 
> >> has only 530mb which is the default value.
> >>
> >>
> >>
> >> Maybe this is a bug.
> >>
> >>
> >>
> >> 2015-10-23 9:43 GMT-02:00 Matej Holec :
> >>
> >> Hello!
> >>
> >> How to adjust the memory settings properly for SparkR with 
> >> master="local[*]"
> >> in R?
> >>
> >>
> >> *When running from  R -- SparkR doesn't accept memory settings :(*
> >>
> >> I use the following commands:
> >>
> >> R>  library(SparkR)
> >> R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
> >> list(spark.driver.memory = "5g"))
> >>
> >> Despite the variable spark.driver.memory is correctly set (checked in 
> >> http://node:4040/environment/), the driver has only the default 
> >> amount of memory allocated (Storage Memory 530.3 MB).
> >>
> >> *But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*
> >>
> >> The following command:
> >>
> >> ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g
> >>
> >> creates SparkR session with properly adjustest driver memory (Storage 
> >> Memory
> >> 2.6 GB).
> >>
> >>
> >> Any suggestion?
> >>
> >> Thanks
> >> Matej
> >>
> >>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11340) Support setting driver properties when starting Spark from R programmatically or from RStudio

2015-10-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983294#comment-14983294
 ] 

Shivaram Venkataraman commented on SPARK-11340:
---

[~felixcheung][~sunrui] It'll be great if we can ask the user who reported this 
to test the fix !

> Support setting driver properties when starting Spark from R programmatically 
> or from RStudio
> -
>
> Key: SPARK-11340
> URL: https://issues.apache.org/jira/browse/SPARK-11340
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently when sparkR.init() is called in 'client' mode, it launches the JVM 
> backend but driver properties (like driver-memory) are not passed or settable 
> by the user calling sparkR.init().
> [~sunrui][~shivaram] and I discussed this offline and think we should support 
> this.
> This is the original thread:
> >> From: rui@intel.com
> >> To: dirceu.semigh...@gmail.com
> >> CC: u...@spark.apache.org
> >> Subject: RE: How to set memory for SparkR with master="local[*]"
> >> Date: Mon, 26 Oct 2015 02:24:00 +
> >>
> >> As documented in
> >> http://spark.apache.org/docs/latest/configuration.html#available-prop
> >> e
> >> rties,
> >>
> >> Note for “spark.driver.memory”:
> >>
> >> Note: In client mode, this config must not be set through the 
> >> SparkConf directly in your application, because the driver JVM has 
> >> already started at that point. Instead, please set this through the 
> >> --driver-memory command line option or in your default properties file.
> >>
> >>
> >>
> >> If you are to start a SparkR shell using bin/sparkR, then you can use 
> >> bin/sparkR –driver-memory. You have no chance to set the driver 
> >> memory size after the R shell has been launched via bin/sparkR.
> >>
> >>
> >>
> >> Buf if you are to start a SparkR shell manually without using 
> >> bin/sparkR (for example, in Rstudio), you can:
> >>
> >> library(SparkR)
> >>
> >> Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g
> >> sparkr-shell")
> >>
> >> sc <- sparkR.init()
> >>
> >>
> >>
> >> From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
> >> Sent: Friday, October 23, 2015 7:53 PM
> >> Cc: user
> >> Subject: Re: How to set memory for SparkR with master="local[*]"
> >>
> >>
> >>
> >> Hi Matej,
> >>
> >> I'm also using this and I'm having the same behavior here, my driver 
> >> has only 530mb which is the default value.
> >>
> >>
> >>
> >> Maybe this is a bug.
> >>
> >>
> >>
> >> 2015-10-23 9:43 GMT-02:00 Matej Holec :
> >>
> >> Hello!
> >>
> >> How to adjust the memory settings properly for SparkR with 
> >> master="local[*]"
> >> in R?
> >>
> >>
> >> *When running from  R -- SparkR doesn't accept memory settings :(*
> >>
> >> I use the following commands:
> >>
> >> R>  library(SparkR)
> >> R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
> >> list(spark.driver.memory = "5g"))
> >>
> >> Despite the variable spark.driver.memory is correctly set (checked in 
> >> http://node:4040/environment/), the driver has only the default 
> >> amount of memory allocated (Storage Memory 530.3 MB).
> >>
> >> *But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*
> >>
> >> The following command:
> >>
> >> ]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g
> >>
> >> creates SparkR session with properly adjustest driver memory (Storage 
> >> Memory
> >> 2.6 GB).
> >>
> >>
> >> Any suggestion?
> >>
> >> Thanks
> >> Matej
> >>
> >>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11429) Provide Number of Iterations used in ALS

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11429.
---
Resolution: Invalid

As the caller, you'd already know that right? u...@spark.apache.org is the 
place for questions. 

> Provide Number of Iterations used in ALS
> 
>
> Key: SPARK-11429
> URL: https://issues.apache.org/jira/browse/SPARK-11429
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Richard Garris
>
> There is the maxIter parameter but if I want to know how many iterations were 
> actually used how do I get that?
> Is there a getIteration to get the value after the model has been trained?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11236) Upgrade Tachyon dependency to 0.8.0

2015-10-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983630#comment-14983630
 ] 

Yin Huai commented on SPARK-11236:
--

I reverted the pr because it broke lots of hadoop 1 tests 
(https://github.com/apache/spark/commit/e8ec2a7b01cc86329a6fbafc3d371bdfd79fc1d6).

> Upgrade Tachyon dependency to 0.8.0
> ---
>
> Key: SPARK-11236
> URL: https://issues.apache.org/jira/browse/SPARK-11236
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Calvin Jia
>
> Update the tachyon-client dependency from 0.7.1 to 0.8.0. There are no new 
> dependencies added or Spark facing APIs changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11300) Support for string length when writing to JDBC

2015-10-30 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983631#comment-14983631
 ] 

Suresh Thalamati commented on SPARK-11300:
--

Another related issue is Spark-10849, Fix will  allow users to override data 
type  for any field. 

> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11432) Personalized PageRank shouldn't use uniform initialization

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11432:
--
Assignee: Yves Raimond

> Personalized PageRank shouldn't use uniform initialization
> --
>
> Key: SPARK-11432
> URL: https://issues.apache.org/jira/browse/SPARK-11432
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.1
>Reporter: Yves Raimond
>Assignee: Yves Raimond
>Priority: Minor
>
> The current implementation of personalized pagerank in GraphX uses uniform 
> initialization over the full graph - every vertex will get initially 
> activated.
> For example:
> {code}
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> val users: RDD[(VertexId, (String, String))] =
>   sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", 
> "postdoc")),
>(5L, ("franklin", "prof")), (2L, ("istoica", "prof"
> val relationships: RDD[Edge[String]] =
>   sc.parallelize(Array(Edge(3L, 7L, "collab"),Edge(5L, 3L, "advisor"),
>Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> val defaultUser = ("John Doe", "Missing")
> val graph = Graph(users, relationships, defaultUser)
> graph.staticPersonalizedPageRank(3L, 0, 
> 0.15).vertices.collect.foreach(println)
> {code}
> Leads to all vertices being set to resetProb (0.15), which is different from 
> the behavior described in SPARK-5854, where only the source node should be 
> activated. 
> The risk is that, after a few iterations, the most activated nodes are the 
> source node and the nodes that were untouched by the propagation. For example 
> in the above example the vertex 2L will always have an activation of 0.15:
> {code}
> graph.personalizedPageRank(3L, 0, 0.15).vertices.collect.foreach(println)
> {code}
> Which leads into a higher score for 2L than for 7L and 5L, even though 
> there's no outbound path from 3L to 2L.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983561#comment-14983561
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983563#comment-14983563
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10158) ALS should print better errors when given Long IDs

2015-10-30 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983644#comment-14983644
 ] 

Bryan Cutler commented on SPARK-10158:
--

The only way I can see handling this from the PySpark side is to add something 
like the following to {{ALS._prepare}} 
([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215])
 which is called before training

{noformat}
MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE
if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > 
MAX_ID_VALUE).count() > 0:
  raise ValueError("Rating IDs must be less than max Java int %s." % 
str(MAX_ID_VALUE))
{noformat}

But any operations on the data are probably not worth the hit for this issue

> ALS should print better errors when given Long IDs
> --
>
> Key: SPARK-10158
> URL: https://issues.apache.org/jira/browse/SPARK-10158
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See [SPARK-10115] for the very confusing messages you get when you try to use 
> ALS with Long IDs.  We should catch and identify these errors and print 
> meaningful error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11428) Schema Merging Broken for Some Queries

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983377#comment-14983377
 ] 

Sean Owen commented on SPARK-11428:
---

Related to https://issues.apache.org/jira/browse/SPARK-11412 ?

> Schema Merging Broken for Some Queries
> --
>
> Key: SPARK-11428
> URL: https://issues.apache.org/jira/browse/SPARK-11428
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.5.1
> Environment: AWS,
>Reporter: Brad Willard
>  Labels: dataframe, parquet, pyspark, schema, sparksql
>
> I have data being written into parquet format via spark streaming. The data 
> can change slightly so schema merging is required. I load a dataframe like 
> this
> {code}
> urls = [
> "/streaming/parquet/events/key=2015-10-30*",
> "/streaming/parquet/events/key=2015-10-29*"
> ]
> sdf = sql_context.read.option("mergeSchema", "true").parquet(*urls)
> sdf.registerTempTable('events')
> {code}
> If I print the schema you can see the contested column
> {code}
> sdf.printSchema()
> root
>  |-- _id: string (nullable = true)
> ...
>  |-- d__device_s: string (nullable = true)
>  |-- d__isActualPageLoad_s: string (nullable = true)
>  |-- d__landing_s: string (nullable = true)
>  |-- d__lang_s: string (nullable = true)
>  |-- d__os_s: string (nullable = true)
>  |-- d__performance_i: long (nullable = true)
>  |-- d__product_s: string (nullable = true)
>  |-- d__refer_s: string (nullable = true)
>  |-- d__rk_i: long (nullable = true)
>  |-- d__screen_s: string (nullable = true)
>  |-- d__submenuName_s: string (nullable = true)
> {code}
> The column that's in one but not the other file is  d__product_s
> So I'm able to run this query and it works fine.
> {code}
> sql_context.sql('''
> select 
> distinct(d__product_s) 
> from 
> events
> where 
> n = 'view'
> ''').collect()
> [Row(d__product_s=u'website'),
>  Row(d__product_s=u'store'),
>  Row(d__product_s=None),
>  Row(d__product_s=u'page')]
> {code}
> However if I instead use that column in the where clause things break.
> {code}
> sql_context.sql('''
> select 
> * 
> from 
> events
> where 
> n = 'view' and d__product_s = 'page'
> ''').take(1)
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>   6 where
>   7 n = 'frontsite_view' and d__product_s = 'page'
> > 8 ''').take(1)
> /root/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
> 303 with SCCallSiteSync(self._sc) as css:
> 304 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 305 self._jdf, num)
> 306 return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
> 307 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /root/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  34 def deco(*a, **kw):
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
>  38 s = e.java_exception.toString()
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 15.0 failed 30 times, most recent failure: Lost task 0.29 in stage 
> 15.0 (TID 6536, 10.X.X.X): java.lang.IllegalArgumentException: Column 
> [d__product_s] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
> 

[jira] [Created] (SPARK-11429) Provide Number of Iterations used in ALS

2015-10-30 Thread Richard Garris (JIRA)
Richard Garris created SPARK-11429:
--

 Summary: Provide Number of Iterations used in ALS
 Key: SPARK-11429
 URL: https://issues.apache.org/jira/browse/SPARK-11429
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Richard Garris


There is the maxIter parameter but if I want to know how many iterations were 
actually used how do I get that?

Is there a getIteration to get the value after the model has been trained?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11430) DataFrame's except method does not work, returns 0

2015-10-30 Thread Ram Kandasamy (JIRA)
Ram Kandasamy created SPARK-11430:
-

 Summary: DataFrame's except method does not work, returns 0
 Key: SPARK-11430
 URL: https://issues.apache.org/jira/browse/SPARK-11430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Ram Kandasamy


This may or may not be related to this bug here: 
https://issues.apache.org/jira/browse/SPARK-11427

But basically, the except method in dataframes should mirror the functionality 
of the subtract method in RDDs, but it is not doing so.

Here is an example:
scala> val firstFile = 
sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
firstFile: org.apache.spark.sql.DataFrame = [id: string]

scala> val secondFile = 
sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-10-23/*").select("id").distinct
secondFile: org.apache.spark.sql.DataFrame = [id: string]

scala> firstFile.count
res1: Long = 1072046

scala> secondFile.count
res2: Long = 3569941

scala> firstFile.except(secondFile).count
res3: Long = 0

scala> firstFile.rdd.subtract(secondFile.rdd).count
res4: Long = 1072046

Can anyone help out here? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11430) DataFrame's except method does not work, returns 0

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11430:
---

[~ramk256] the right resolution is duplicate

> DataFrame's except method does not work, returns 0
> --
>
> Key: SPARK-11430
> URL: https://issues.apache.org/jira/browse/SPARK-11430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
>
> This may or may not be related to this bug here: 
> https://issues.apache.org/jira/browse/SPARK-11427
> But basically, the except method in dataframes should mirror the 
> functionality of the subtract method in RDDs, but it is not doing so.
> Here is an example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> val secondFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-10-23/*").select("id").distinct
> secondFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res1: Long = 1072046
> scala> secondFile.count
> res2: Long = 3569941
> scala> firstFile.except(secondFile).count
> res3: Long = 0
> scala> firstFile.rdd.subtract(secondFile.rdd).count
> res4: Long = 1072046
> Can anyone help out here? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11430) DataFrame's except method does not work, returns 0

2015-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11430.
---
   Resolution: Duplicate
Fix Version/s: (was: 1.5.1)

> DataFrame's except method does not work, returns 0
> --
>
> Key: SPARK-11430
> URL: https://issues.apache.org/jira/browse/SPARK-11430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ram Kandasamy
>
> This may or may not be related to this bug here: 
> https://issues.apache.org/jira/browse/SPARK-11427
> But basically, the except method in dataframes should mirror the 
> functionality of the subtract method in RDDs, but it is not doing so.
> Here is an example:
> scala> val firstFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct
> firstFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> val secondFile = 
> sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-10-23/*").select("id").distinct
> secondFile: org.apache.spark.sql.DataFrame = [id: string]
> scala> firstFile.count
> res1: Long = 1072046
> scala> secondFile.count
> res2: Long = 3569941
> scala> firstFile.except(secondFile).count
> res3: Long = 0
> scala> firstFile.rdd.subtract(secondFile.rdd).count
> res4: Long = 1072046
> Can anyone help out here? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983565#comment-14983565
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983566#comment-14983566
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983567#comment-14983567
 ] 

Xiangrui Meng commented on SPARK-11337:
---

I totally forgot I already made one ...

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10946) JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs

2015-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982224#comment-14982224
 ] 

Sean Owen commented on SPARK-10946:
---

[~somi...@us.ibm.com] you just need to open a pull request  
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate 
> for DDLs
> --
>
> Key: SPARK-10946
> URL: https://issues.apache.org/jira/browse/SPARK-10946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.1
>Reporter: Pallavi Priyadarshini
>Priority: Minor
>
> Certain DataFrame APIs invoke DDLs such as CREATE TABLE and DROP TABLE under 
> the covers. Current code in DataFrameWriter and JDBCUtils uses 
> PreparedStatement.executeUpdate to issue the DDLs to the DBs. This causes the 
> DDLs to fail against couple of databases that do not support prepares of DDLs.
> Can we use Statement.executeUpdate instead of 
> PreparedStatement.executeUpdate? DDL is not a repetitive activity, so there 
> shouldn't be a performance impact.
> I can submit a PULL request if no one has objections.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list file status for all data files

2015-10-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-7673:
--
Summary: DataSourceStrategy's buildPartitionedTableScan always list file 
status for all data files   (was: DataSourceStrategy's 
buildPartitionedTableScan always list list file status for all data files )

> DataSourceStrategy's buildPartitionedTableScan always list file status for 
> all data files 
> --
>
> Key: SPARK-7673
> URL: https://issues.apache.org/jira/browse/SPARK-7673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.4.0
>
>
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >