[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2016-10-13 Thread Chirag Vaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574339#comment-15574339
 ] 

Chirag Vaya commented on SPARK-10063:
-

[~mkim] Can you please tell us in what environment(Standalone Spark on single 
node or multiple nodes or AWS EMR) were you using direct output committer ? 
According to [~rxin], any environment that has network partition (e.g. AWS EMR) 
would lead to inconsistencies. Please correct me if i am wrong on this.

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-10-13 Thread pin_zhang (JIRA)
pin_zhang created SPARK-17932:
-

 Summary: Failed to run SQL "show table extended  like table_name"  
in Spark2.0.0
 Key: SPARK-17932
 URL: https://issues.apache.org/jira/browse/SPARK-17932
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: pin_zhang


SQL "show table extended  like table_name " doesn't work in spark 2.0.0
that works in spark1.5.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17884) In the cast expression, casting from empty string to interval type throws NullPointerException

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574305#comment-15574305
 ] 

Apache Spark commented on SPARK-17884:
--

User 'priyankagargnitk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15479

> In the cast expression, casting from empty string to interval type throws 
> NullPointerException
> --
>
> Key: SPARK-17884
> URL: https://issues.apache.org/jira/browse/SPARK-17884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Priyanka Garg
>Assignee: Priyanka Garg
> Fix For: 2.0.2, 2.1.0
>
>
> When the cast expression is applied on empty string "" to cast it to interval 
> type it throws Null pointer exception..
> Getting the same exception when I tried reproducing the same through test case
> checkEvaluation(Cast(Literal(""), CalendarIntervalType), null)
> Exception i am getting is:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:254)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvalutionWithUnsafeProjection(ExpressionEvalHelper.scala:181)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvalutionWithUnsafeProjection(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvaluation(ExpressionEvalHelper.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvaluation(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply$mcV$sp(CastSuite.scala:770)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> 

[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data

2016-10-13 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17254:

Attachment: stop-after-physical-plan.pdf

> Filter operator should have “stop if false” semantics for sorted data
> -
>
> Key: SPARK-17254
> URL: https://issues.apache.org/jira/browse/SPARK-17254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
>Priority: Minor
> Attachments: stop-after-physical-plan.pdf
>
>
> From 
> https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
> Filter on sorted data
> If the data is sorted by a key, filters on the key could stop as soon as the 
> data is out of range. For example, WHERE ticker_id < “F” should stop as soon 
> as the first row starting with “F” is seen. This can be done adding a Filter 
> operator that has “stop if false” semantics. This is generally useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables

2016-10-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574251#comment-15574251
 ] 

Dongjoon Hyun commented on SPARK-16632:
---

This was backported at the following commit.

https://github.com/apache/spark/commit/f9367d6

> Vectorized parquet reader fails to read certain fields from Hive tables
> ---
>
> Key: SPARK-16632
> URL: https://issues.apache.org/jira/browse/SPARK-16632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hive 1.1 (CDH)
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
>
> The vectorized parquet reader fails to read certain tables created by Hive. 
> When the tables have type "tinyint" or "smallint", Catalyst converts those to 
> "ByteType" and "ShortType" respectively. But when Hive writes those tables in 
> parquet format, the parquet schema in the files contains "int32" fields.
> To reproduce, run these commands in the hive shell (or beeline):
> {code}
> create table abyte (value tinyint) stored as parquet;
> create table ashort (value smallint) stored as parquet;
> insert into abyte values (1);
> insert into ashort values (1);
> {code}
> Then query them with Spark 2.0:
> {code}
> spark.sql("select * from abyte").show();
> spark.sql("select * from ashort").show();
> {code}
> You'll see this exception (for the byte case):
> {noformat}
> 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: 
> Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): 
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {noformat}
> This works when you point Spark directly at the files (instead of using the 
> metastore data), or when you disable the vectorized parquet reader.
> The root cause seems to be that Hive creates these tables with a 
> not-so-complete schema:
> {noformat}
> $ parquet-tools schema /tmp/byte.parquet 
> message hive_schema {
>   optional int32 value;
> }
> {noformat}
> There's no indication that the field is a 32-bit field used to store 8-bit 
> values. When the ParquetReadSupport code tries to consolidate both schemas, 
> it just chooses whatever is in the parquet file for primitive types (see 
> ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst 
> schema, which comes from the Hive metastore, and says it's a byte field, so 
> when it tries to read the data, the byte data stored in "OnHeapColumnVector" 
> is null.
> I have tested a small change to 

[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables

2016-10-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574253#comment-15574253
 ] 

Dongjoon Hyun commented on SPARK-16632:
---

{code}
spark-2.0:branch-2.0$ git log --oneline | grep SPARK-16632
933d76a [SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging 
parquet schema
f9367d6 [SPARK-16632][SQL] Use Spark requested schema to guide vectorized 
Parquet reader initialization
c2b5b3c [SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
{code}

> Vectorized parquet reader fails to read certain fields from Hive tables
> ---
>
> Key: SPARK-16632
> URL: https://issues.apache.org/jira/browse/SPARK-16632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hive 1.1 (CDH)
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
>
> The vectorized parquet reader fails to read certain tables created by Hive. 
> When the tables have type "tinyint" or "smallint", Catalyst converts those to 
> "ByteType" and "ShortType" respectively. But when Hive writes those tables in 
> parquet format, the parquet schema in the files contains "int32" fields.
> To reproduce, run these commands in the hive shell (or beeline):
> {code}
> create table abyte (value tinyint) stored as parquet;
> create table ashort (value smallint) stored as parquet;
> insert into abyte values (1);
> insert into ashort values (1);
> {code}
> Then query them with Spark 2.0:
> {code}
> spark.sql("select * from abyte").show();
> spark.sql("select * from ashort").show();
> {code}
> You'll see this exception (for the byte case):
> {noformat}
> 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: 
> Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): 
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {noformat}
> This works when you point Spark directly at the files (instead of using the 
> metastore data), or when you disable the vectorized parquet reader.
> The root cause seems to be that Hive creates these tables with a 
> not-so-complete schema:
> {noformat}
> $ parquet-tools schema /tmp/byte.parquet 
> message hive_schema {
>   optional int32 value;
> }
> {noformat}
> There's no indication that the field is a 32-bit field used to store 8-bit 
> values. When the ParquetReadSupport code tries to consolidate both schemas, 
> it just chooses whatever is in the parquet file for primitive types (see 
> ParquetReadSupport.clipParquetType); 

[jira] [Updated] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables

2016-10-13 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16632:
--
Fix Version/s: 2.0.1

> Vectorized parquet reader fails to read certain fields from Hive tables
> ---
>
> Key: SPARK-16632
> URL: https://issues.apache.org/jira/browse/SPARK-16632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hive 1.1 (CDH)
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.1, 2.1.0
>
>
> The vectorized parquet reader fails to read certain tables created by Hive. 
> When the tables have type "tinyint" or "smallint", Catalyst converts those to 
> "ByteType" and "ShortType" respectively. But when Hive writes those tables in 
> parquet format, the parquet schema in the files contains "int32" fields.
> To reproduce, run these commands in the hive shell (or beeline):
> {code}
> create table abyte (value tinyint) stored as parquet;
> create table ashort (value smallint) stored as parquet;
> insert into abyte values (1);
> insert into ashort values (1);
> {code}
> Then query them with Spark 2.0:
> {code}
> spark.sql("select * from abyte").show();
> spark.sql("select * from ashort").show();
> {code}
> You'll see this exception (for the byte case):
> {noformat}
> 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: 
> Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): 
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {noformat}
> This works when you point Spark directly at the files (instead of using the 
> metastore data), or when you disable the vectorized parquet reader.
> The root cause seems to be that Hive creates these tables with a 
> not-so-complete schema:
> {noformat}
> $ parquet-tools schema /tmp/byte.parquet 
> message hive_schema {
>   optional int32 value;
> }
> {noformat}
> There's no indication that the field is a 32-bit field used to store 8-bit 
> values. When the ParquetReadSupport code tries to consolidate both schemas, 
> it just chooses whatever is in the parquet file for primitive types (see 
> ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst 
> schema, which comes from the Hive metastore, and says it's a byte field, so 
> when it tries to read the data, the byte data stored in "OnHeapColumnVector" 
> is null.
> I have tested a small change to {{ParquetReadSupport.clipParquetType}} that 
> fixes this particular issue, but I haven't run any other tests, so I'll 

[jira] [Resolved] (SPARK-17927) Remove dead code in WriterContainer

2016-10-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17927.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15477
[https://github.com/apache/spark/pull/15477]

> Remove dead code in WriterContainer
> ---
>
> Key: SPARK-17927
> URL: https://issues.apache.org/jira/browse/SPARK-17927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574137#comment-15574137
 ] 

Felix Cheung commented on SPARK-17781:
--

Hmm.. I'm not quite sure what it is just yet - not seeing that in just R.

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17931) taskScheduler has some unneeded serialization

2016-10-13 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-17931:
---

 Summary: taskScheduler has some unneeded serialization
 Key: SPARK-17931
 URL: https://issues.apache.org/jira/browse/SPARK-17931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Guoqiang Li


When taskScheduler instantiates TaskDescription, it calls 
`Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, 
ser)`.  It serializes task and its dependency. 

But after SPARK-2521 has been merged into the master, the ResultTask class and 
ShuffleMapTask  class no longer contain rdd and closure objects. 
TaskDescription class can be changed as below:

{noformat}
class TaskDescription[T](
val taskId: Long,
val attemptNumber: Int,
val executorId: String,
val name: String,
val index: Int, 
val task: Task[T]) extends Serializable
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused

2016-10-13 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-17930:
---

 Summary: The SerializerInstance instance used when deserializing a 
TaskResult is not reused 
 Key: SPARK-17930
 URL: https://issues.apache.org/jira/browse/SPARK-17930
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.1, 1.6.1
Reporter: Guoqiang Li


The following code is called when the DirectTaskResult instance is deserialized

{noformat}
  def value(): T = {
if (valueObjectDeserialized) {
  valueObject
} else {
  // Each deserialization creates a new instance of SerializerInstance, 
which is very time-consuming
  val resultSer = SparkEnv.get.serializer.newInstance()
  valueObject = resultSer.deserialize(valueBytes)
  valueObjectDeserialized = true
  valueObject
}
  }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

2016-10-13 Thread Weizhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-17929:
-
Summary: Deadlock when AM restart and send RemoveExecutor on reset  (was: 
Deadlock when AM restart send RemoveExecutor)

> Deadlock when AM restart and send RemoveExecutor on reset
> -
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. 
> The reason might be
> // because (1) disconnected event is not yet received; (2) executors die 
> silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
>   driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
> re-registered.")))
> }
>   }
> {code}
> but on removeExecutor also need the lock 
> "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
> and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: 
> ExecutorLossReason): Unit = {
>   logDebug(s"Asked to remove executor $executorId with reason $reason")
>   executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
>   }
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17929) Deadlock when AM restart send RemoveExecutor

2016-10-13 Thread Weizhong (JIRA)
Weizhong created SPARK-17929:


 Summary: Deadlock when AM restart send RemoveExecutor
 Key: SPARK-17929
 URL: https://issues.apache.org/jira/browse/SPARK-17929
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Weizhong
Priority: Minor


We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
{code}
  protected def reset(): Unit = synchronized {
numPendingExecutors = 0
executorsPendingToRemove.clear()

// Remove all the lingering executors that should be removed but not yet. 
The reason might be
// because (1) disconnected event is not yet received; (2) executors die 
silently.
executorDataMap.toMap.foreach { case (eid, _) =>
  driverEndpoint.askWithRetry[Boolean](
RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager 
re-registered.")))
}
  }
{code}
but on removeExecutor also need the lock 
"CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, 
and send RPC will failed, and reset failed
{code}
private def removeExecutor(executorId: String, reason: ExecutorLossReason): 
Unit = {
  logDebug(s"Asked to remove executor $executorId with reason $reason")
  executorDataMap.get(executorId) match {
case Some(executorInfo) =>
  // This must be synchronized because variables mutated
  // in this block are read when requesting executors
  val killed = CoarseGrainedSchedulerBackend.this.synchronized {
addressToExecutorId -= executorInfo.executorAddress
executorDataMap -= executorId
executorsPendingLossReason -= executorId
executorsPendingToRemove.remove(executorId).getOrElse(false)
  }
 ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573961#comment-15573961
 ] 

Apache Spark commented on SPARK-17899:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15478

> add a debug mode to keep raw table properties in HiveExternalCatalog
> 
>
> Key: SPARK-17899
> URL: https://issues.apache.org/jira/browse/SPARK-17899
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-12664:
---

Assignee: Yanbo Liang

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Yanbo Liang
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode

2016-10-13 Thread Drew Robb (JIRA)
Drew Robb created SPARK-17928:
-

 Summary: No driver.memoryOverhead setting for mesos cluster mode
 Key: SPARK-17928
 URL: https://issues.apache.org/jira/browse/SPARK-17928
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.0.1
Reporter: Drew Robb


Mesos cluster mode does not have a configuration setting for the driver's 
memory overhead. This makes scheduling long running drivers on mesos using 
dispatcher very unreliable. There is an equivalent setting for yarn-- 
spark.yarn.driver.memoryOverhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17898) --repositories needs username and password

2016-10-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573929#comment-15573929
 ] 

lichenglin edited comment on SPARK-17898 at 10/14/16 2:41 AM:
--

I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

I have tried make a jar with mysql-jdbc.jar.

And I have change the mainfest's classpath.

It Failed.


was (Author: licl):
I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17898) --repositories needs username and password

2016-10-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573929#comment-15573929
 ] 

lichenglin commented on SPARK-17898:


I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573843#comment-15573843
 ] 

Cody Koeninger commented on SPARK-17812:


So I think this is what we're agreed on:

Mutually exclusive subscription options (only assign is new to this ticket)
{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

Single starting position option with three mutually exclusive types of value
{noformat}
.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, 
"1": -2}, "topicBar":{"0": -1}}""")
{noformat}

startingOffsets with json fails if any topicpartition in the assignments 
doesn't have an offset.

Sound right?

I'll go ahead and start on it.  I'm assuming I should try to reuse some of the 
existing catalyst Jackson stuff and keep in mind a format that's potentially 
usable by the checkpoints as well?

I don't think earliest / latest is too unclear as long as there's a way to get 
to the other knobs that auto.offset.reset (should have) provided. Punting the 
tunability of new partitions to another ticket sounds good.  


> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17927) Remove dead code in WriterContainer

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17927:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove dead code in WriterContainer
> ---
>
> Key: SPARK-17927
> URL: https://issues.apache.org/jira/browse/SPARK-17927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17927) Remove dead code in WriterContainer

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573838#comment-15573838
 ] 

Apache Spark commented on SPARK-17927:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15477

> Remove dead code in WriterContainer
> ---
>
> Key: SPARK-17927
> URL: https://issues.apache.org/jira/browse/SPARK-17927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17927) Remove dead code in WriterContainer

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17927:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove dead code in WriterContainer
> ---
>
> Key: SPARK-17927
> URL: https://issues.apache.org/jira/browse/SPARK-17927
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17927) Remove dead code in WriterContainer

2016-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17927:
---

 Summary: Remove dead code in WriterContainer
 Key: SPARK-17927
 URL: https://issues.apache.org/jira/browse/SPARK-17927
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17813) Maximum data per trigger

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573806#comment-15573806
 ] 

Cody Koeninger commented on SPARK-17813:


So issues to be worked out here (assuming we're still ignoring compacted topics)

maxOffsetsPerTrigger - how are these maximums distributed among partitions?  
What about skewed topics / partitions?

maxOffsetsPerTopicPartitionPerTrigger - (this isn't just hypothetical, e.g. 
SPARK-17510) If we do this, how is this configuration communicated? 

{noformat}
option("maxOffsetsPerTopicPartitionPerTrigger", """{"topicFoo":{"0":600}, 
"topicBar":{"0":300, "1": 600}}""")
{noformat}

{noformat}
option("maxOffsetsPerTopicPerTrigger", """{"topicFoo": 600, "topicBar": 300}""")
{noformat}



> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-17812:


Assignee: Cody Koeninger

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17926) Add methods to convert StreamingQueryStatus to json

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573788#comment-15573788
 ] 

Apache Spark commented on SPARK-17926:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15476

> Add methods to convert StreamingQueryStatus to json
> ---
>
> Key: SPARK-17926
> URL: https://issues.apache.org/jira/browse/SPARK-17926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Useful for recording StreamingQueryStatuses when exposed through 
> StreamingQueryListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17926) Add methods to convert StreamingQueryStatus to json

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17926:


Assignee: Apache Spark  (was: Tathagata Das)

> Add methods to convert StreamingQueryStatus to json
> ---
>
> Key: SPARK-17926
> URL: https://issues.apache.org/jira/browse/SPARK-17926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Useful for recording StreamingQueryStatuses when exposed through 
> StreamingQueryListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573786#comment-15573786
 ] 

Michael Armbrust commented on SPARK-17812:
--

Please do work on it.  It might be good to update the description with a 
summary of this discussion so we can all be sure we are on the same page.

I actually do think its fair to have one configuration for what to do in the 
case of data loss.  This happens when you fall behind or when you come back and 
new partitions are there that have already aged out.  Lets add this in another 
ticket.

I know you are super deep in Kafka and other should chime in if I'm way 
off-base, but I think that {{startingOffsets=earliest}} and 
{{startingOffsets=latest}} is pretty clear what is happening.  I would not 
change {{earliest}} and {{latest}} just to be different from kafka.  We could 
make it query start if this is still confusing, but lets do that soon if that 
is the case.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17926) Add methods to convert StreamingQueryStatus to json

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17926:


Assignee: Tathagata Das  (was: Apache Spark)

> Add methods to convert StreamingQueryStatus to json
> ---
>
> Key: SPARK-17926
> URL: https://issues.apache.org/jira/browse/SPARK-17926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Useful for recording StreamingQueryStatuses when exposed through 
> StreamingQueryListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17926) Add methods to convert StreamingQueryStatus to json

2016-10-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-17926:
--
Description: Useful for recording StreamingQueryStatuses when exposed 
through StreamingQueryListener

> Add methods to convert StreamingQueryStatus to json
> ---
>
> Key: SPARK-17926
> URL: https://issues.apache.org/jira/browse/SPARK-17926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Useful for recording StreamingQueryStatuses when exposed through 
> StreamingQueryListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17926) Add methods to convert StreamingQueryStatus to json

2016-10-13 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-17926:
-

 Summary: Add methods to convert StreamingQueryStatus to json
 Key: SPARK-17926
 URL: https://issues.apache.org/jira/browse/SPARK-17926
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573766#comment-15573766
 ] 

Cody Koeninger commented on SPARK-17812:


OK, failing on start is clear (it's really annoying in the case of 
subscribePattern), but at least it's clear.  I think that's enough to get 
started on this ticket, is anyone currently working on it or can I do it?  Ryan 
seemed worried that it wouldn't get done in time for the next release.

It sounds like your current plan is to ignore whatever comes out of KAFKA-3370, 
which is fine as long as whatever you do is both clear and equally tunable.  
Clarity of semantics can't be the only criterion of an API, "You can only start 
at latest offset, period" is clear, but a crap api.

{quote}
the only case where we lack sufficient tunability is "Where do I go when the 
current offsets are invalid due to retention?".
{quote}

No, you lack sufficient tunability as to where newly discovered partitions 
start.  Keep in mind that those partitions may have been discovered after a 
significant job downtime.  If the point of an API is to provide clear semantics 
to the user, it is not at all clear to me as a user how I can start those 
partitions at latest, which I know is possible in the underlying data model.

The reason I'm belaboring this point now is that you have chosen names 
(earliest, latest) for the API currently under discussion that are confusingly 
similar to the existing auto offset reset functionality, and you have provided 
knobs for some, but not all, of the things auto offset reset currently affects. 
 This is going to confuse people, it already confuses me.



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-17368.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15284
[https://github.com/apache/spark/pull/15284]

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
> Fix For: 2.1.0
>
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573692#comment-15573692
 ] 

Hossein Falaki commented on SPARK-17916:


Thanks for linking it. Yes they are very much same issues. However, I slightly 
disagree with the proposed solution. I will comment on the PR.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573677#comment-15573677
 ] 

Michael Armbrust commented on SPARK-17812:
--

bq. with your proposed interface, what, as a user, do you expect to happen when 
you specify startingOffsets for some but not all partitions?

I would probably opt to fail to start the query with advice on how to fix it 
(i.e. "specify {{-1}} for these partitions if you don't care").  We could also 
have a default, but I tend to error on the side of explicit behavior.

bq. Yes, auto.offset.reset is a mess. Have you read 
https://issues.apache.org/jira/browse/KAFKA-3370 What are you going to do when 
that ticket is resolved? It should allow users to answer the questions you 
raised in very specific ways, that your interface does not.

There is clearly a lot of confusing baggage with this configuration option, 
specifically because it is conflating too many unrelated concerns. Furthermore, 
IMHO {{auto.offset.reset}} is a pretty confusing name that does not imply 
anything about where in the stream this query should start. "reset" implies you 
were set somewhere to begin with.

In contrast, {{startingOffsets}} handles one case clearly: it picks the offsets 
that are used as a starting point for the append only table abstraction that 
Spark is providing.

As far as I understand the discussion on the ticket you referenced, the only 
case where we lack sufficient tunability is "Where do I go when the current 
offsets are invalid due to retention?".

In this case, where data has been lost and {{failOnDataLoss=false}}, we 
currently try to minimize the amount of data we lose by starting at the 
earliest offsets available.  We should certainly consider making this behavior 
configurable as well, but that seems like a different concern than what is 
being discussed in this JIRA.

Personally, it seems like if you are falling so far behind that you have to 
skip all the way ahead, something is going very wrong.  However, if users 
request this feature, we should certainly add it. I would not, however, 
shoe-horn it into anything having to do with query start behavior. It seems 
like they have reached a similar conclusion, as they are considering adding a 
new configuration, {{auto.reset.offset.existing}}.

bq. Is the purpose of your interface to do what you think users should be able 
to do, or what they need to be able to do?

The purpose of an interface is to provide clear semantics to the user.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573668#comment-15573668
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Hi [~falaki], this JIRA rings a bell to me. Do you mind if I ask to take a look 
https://github.com/apache/spark/pull/12904 and SPARK-15125. Could you confirm 
that it is a duplicate maybe?

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17925:


Assignee: Apache Spark  (was: Reynold Xin)

> Break fileSourceInterfaces.scala into multiple pieces
> -
>
> Key: SPARK-17925
> URL: https://issues.apache.org/jira/browse/SPARK-17925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> fileSourceInterfaces.scala has grown fairly large, making it difficult to 
> understand. To facilitate code review, I'd submit a separate patch to break 
> this into multiple pieces first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573640#comment-15573640
 ] 

Apache Spark commented on SPARK-17925:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15473

> Break fileSourceInterfaces.scala into multiple pieces
> -
>
> Key: SPARK-17925
> URL: https://issues.apache.org/jira/browse/SPARK-17925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> fileSourceInterfaces.scala has grown fairly large, making it difficult to 
> understand. To facilitate code review, I'd submit a separate patch to break 
> this into multiple pieces first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17925:


Assignee: Reynold Xin  (was: Apache Spark)

> Break fileSourceInterfaces.scala into multiple pieces
> -
>
> Key: SPARK-17925
> URL: https://issues.apache.org/jira/browse/SPARK-17925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> fileSourceInterfaces.scala has grown fairly large, making it difficult to 
> understand. To facilitate code review, I'd submit a separate patch to break 
> this into multiple pieces first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces

2016-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17925:
---

 Summary: Break fileSourceInterfaces.scala into multiple pieces
 Key: SPARK-17925
 URL: https://issues.apache.org/jira/browse/SPARK-17925
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


fileSourceInterfaces.scala has grown fairly large, making it difficult to 
understand. To facilitate code review, I'd submit a separate patch to break 
this into multiple pieces first.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17924) Consolidate streaming and batch write path

2016-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17924:
---

 Summary: Consolidate streaming and batch write path
 Key: SPARK-17924
 URL: https://issues.apache.org/jira/browse/SPARK-17924
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Structured streaming and normal SQL operation currently have two separate write 
path, leading to a lot of duplicated functions (that look similar) and if 
branches. The purpose of this ticket is to consolidate the two as much as 
possible to make the write path more clear.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17678) Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"

2016-10-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17678:
-
Affects Version/s: (was: 1.6.3)
   1.6.2

> Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"
> 
>
> Key: SPARK-17678
> URL: https://issues.apache.org/jira/browse/SPARK-17678
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.2
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 1.6.3
>
>
> Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" 
> configuration, so user cannot set a fixed port number through 
> "spark.replClassServer.port".
> There's no issue in Spark2.0+, since this class is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17678) Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"

2016-10-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17678.
--
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 1.6.3

> Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"
> 
>
> Key: SPARK-17678
> URL: https://issues.apache.org/jira/browse/SPARK-17678
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.3
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 1.6.3
>
>
> Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" 
> configuration, so user cannot set a fixed port number through 
> "spark.replClassServer.port".
> There's no issue in Spark2.0+, since this class is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573563#comment-15573563
 ] 

Cody Koeninger commented on SPARK-17812:


So a short term question
- with your proposed interface, what, as a user, do you expect to happen when 
you specify startingOffsets for some but not all partitions?

A couple of medium term questions:
- Yes, auto.offset.reset is a mess.  Have you read 
https://issues.apache.org/jira/browse/KAFKA-3370
- What are you going to do when that ticket is resolved?  It should allow users 
to answer the questions you raised in very specific ways, that your interface 
does not.

And a really leading long term question:
- Is the purpose of your interface to do what you think users should be able to 
do, or what they need to be able to do?

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573526#comment-15573526
 ] 

Michael Armbrust commented on SPARK-17812:
--

As far as I understand it, {{auto.offset.reset}} is conflating a few things 
that make it hard for me to reason about exactly-once semantics in my query.  
It is answering all of the following:
 - Where do I start when I'm creating this {{group.id}} for the first time?
 - What do I do when a new partition is added to a topic I'm watching?
 - What do I do when the current offset is invalid because of retention?

The model of structured streaming is an append only table, where we are 
computing the same answer incrementally as if you were running a batch query 
over all of the data in the table.  The whole goal is to make it easy to reason 
about correctness and push the hard work of incremental processing and late 
data management into the optimizer / query planner.  As a result, I think we 
are trying to answer a different set of questions than a distributed set of 
consumers that share a {{group.id}}:
 - Should this append only table contain all of the historical data available, 
or do I begin at this moment and start appending?  This is what 
{{startingOffsets}} answers.  I think we should handle {{"earliest"}} (all 
data), {{"latest"}} (only data that arrives after now), and a very specific 
point in time across partitions (probably when some other query stopped 
running).
 - When I get into a situation where data has been deleted by the retention 
mechanism without me seeing it, what should I do?  Fail the query?  Or issue a 
warning and compute best effort on the data available.   This is what 
{{failOnDataLoss}} answers.

In particular, I think the kafka method of configuration makes it confusing to 
do something like, "starting now, compute some aggregation exactly once".  The 
documentation even points out some of the pit falls:
bq. ... If this is set to largest, the consumer may lose some messages when the 
number of partitions, for the topics it subscribes to, changes on the broker. 
To prevent data loss during partition addition, set auto.offset.reset to 
smallest.

Really what I want here is, "begin the query at largest", but "start new 
partitions at smallest (and in fact, tell me if I'm so late joining a new 
partition that I have already lost some data)".



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-13 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573512#comment-15573512
 ] 

Ashish Shrowty commented on SPARK-17709:


[~smilegator] I compiled with the added debug information and here is the 
output -

{code:java}
scala> val d1 = spark.sql("select * from testext2")
d1: org.apache.spark.sql.DataFrame = [productid: int, price: float ... 2 more 
fields]

scala> val df1 = 
d1.groupBy("companyid","productid").agg(sum("price").as("price"))
df1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 
more field]

scala> val df2 = 
d1.groupBy("companyid","productid").agg(sum("count").as("count"))
df2: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 
more field]

scala> df1.join(df2, Seq("companyid", "productid")).show
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] 
can not be resolved given input columns: [companyid, productid, price, count] ;;
'Join UsingJoin(Inner,List('companyid, 'productid))
:- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, 
sum(cast(price#123 as double)) AS price#166]
:  +- Project [productid#122, price#123, count#124, companyid#121]
: +- SubqueryAlias testext2
:+- Relation[productid#122,price#123,count#124,companyid#121] parquet
+- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, 
sum(cast(count#124 as bigint)) AS count#177L]
   +- Project [productid#122, price#123, count#124, companyid#121]
  +- SubqueryAlias testext2
 +- Relation[productid#122,price#123,count#124,companyid#121] parquet

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 48 elided

{code}

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573479#comment-15573479
 ] 

Cody Koeninger commented on SPARK-17812:


While some decision is better than none, can you help me understand why you 
don't believe me that auto.offset.reset is orthogonal to specifying specific 
starting positions?  Or do you just not believe it's important?

The reasons you guys used a different name from auto.offset.reset are that the 
Kafka project semantics of it are inadequate. But they will fix it, and when 
they do, the fact that you have conflated two unrelated things into one 
configuration in your api is going to cause problems.



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573469#comment-15573469
 ] 

Felix Cheung commented on SPARK-17919:
--

Earlier bug:
https://issues.apache.org/jira/browse/SPARK-12609


> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573459#comment-15573459
 ] 

Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM:
-

+1 to the suggested was of subscribing, and for using "assign" as a familiar 
name.

I would probably leave it with a single option like this:
{code}
.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, 
"1", 4567}}""")
{code}

Were you can give -1 or -2 (again following kafka) for specific partitions.  
{{startingTime}} could be added when we support time indexes.


was (Author: marmbrus):
+1 to the suggested was of subscribing, and for using "assign" as a familiar 
name.

I would probably leave it with a single option like this:
{{.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
1234, "1", 4567\}\}"""}}

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573459#comment-15573459
 ] 

Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM:
-

+1 to the suggested ways of subscribing, and for using "assign" as a familiar 
name.

I would probably leave it with a single option like this:
{code}
.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, 
"1", 4567}}""")
{code}

Were you can give -1 or -2 (again following kafka) for specific partitions.  
{{startingTime}} could be added when we support time indexes.


was (Author: marmbrus):
+1 to the suggested was of subscribing, and for using "assign" as a familiar 
name.

I would probably leave it with a single option like this:
{code}
.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, 
"1", 4567}}""")
{code}

Were you can give -1 or -2 (again following kafka) for specific partitions.  
{{startingTime}} could be added when we support time indexes.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573457#comment-15573457
 ] 

Ofir Manor commented on SPARK-17812:


I'm with you - I warned you it is bikeshedding...
I don't have a strong opinion, just a preference, and what you suggested is way 
better then the commited solution, so I'll get out of the loop.
Whatever [~marmbrus] and you are OK with - either way it would be a big step 
forward

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573459#comment-15573459
 ] 

Michael Armbrust commented on SPARK-17812:
--

+1 to the suggested was of subscribing, and for using "assign" as a familiar 
name.

I would probably leave it with a single option like this:
{{.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
1234, "1", 4567\}\}"""}}

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573432#comment-15573432
 ] 

Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:44 PM:
---

If you're seriously worried that people are going to get confused,

{noformat}
.option("defaultOffsets", "earliest" | "latest")
.option("specificOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""")
{noformat}

let those two at least not be mutually exclusive, and punt on the question of 
precedence until there's an actual startingTime or startingX ticket.




was (Author: c...@koeninger.org):
If you're seriously worried that people are going to get confused,

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""")
.option("defaultOffsets", "earliest" | "latest")
{noformat}

let those two at least not be mutually exclusive, and punt on the question of 
precedence until there's an actual startingTime or startingX ticket.



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573432#comment-15573432
 ] 

Cody Koeninger commented on SPARK-17812:


If you're seriously worried that people are going to get confused,

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""")
.option("defaultOffsets", "earliest" | "latest")
{noformat}

let those two at least not be mutually exclusive, and punt on the question of 
precedence until there's an actual startingTime or startingX ticket.



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573395#comment-15573395
 ] 

Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:25 PM:
---

1. we dont have lists, we have strings.  regexes and valid topic names have 
overlaps (dot is the obvious one).

2. Mapping directly to kafka method names means we don't have to come up with 
some other (weird and possibly overlapping) name when they add more ways to 
subscribe, we just use theirs.

3. I think this "starting X mssages" is a mess with kafka semantics for the 
reasons both you and I have already expressed.  At any rate, I think Michael 
already clearly punted the "starting X messages" case to a different ticket.

4. I  think it's more than sufficiently clear as suggested, no one is going to 
expect that a specific offset they provided is going to be overruled by a 
general single default.   The implementation is also crystal clear - seek to 
the position identified by startingTime, then seek to any specific offsets for 
specific partitions

Yes, this is all bikeshedding, but it's bikeshedding that directly affects what 
people are actually able to do with the api.  Needlessly restricting it for 
reasons that have nothing to do with safety is just going to piss users off for 
no reason. Just because you don't have a use case that needs it, doesn't mean 
you should arbitrarily prevent users from doing it.

Please, just choose something and let me build it so that people can actually 
use the thing by the next release


was (Author: c...@koeninger.org):
1. we dont have lists, we have strings.  regexes and valid topic names have 
overlaps (dot is the obvious one).

2. Mapping directly to kafka method names means we don't have to come up with 
some other (weird and possibly overlapping) name when they add more ways to 
subscribe, we just use theirs.

3. I think this is a mess with kafka semantics for the reasons both you and I 
have already expressed.  At any rate, I think Michael already clearly punted 
the "starting X" case to a different topic.

4. I  think it's more than sufficiently clear as suggested, no one is going to 
expect that a specific offset they provided is going to be overruled by a 
general single default.   The implementation is also crystal clear - seek to 
the position identified by startingTime, then seek to any specific offsets for 
specific partitions

Yes, this is all bikeshedding, but it's bikeshedding that directly affects what 
people are actually able to do with the api.  Needlessly restricting it for 
reasons that have nothing to do with safety is just going to piss users off for 
no reason. Just because you don't have a use case that needs it, doesn't mean 
you should arbitrarily prevent users from doing it.

Please, just choose something and let me build it so that people can actually 
use the thing by the next release

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573395#comment-15573395
 ] 

Cody Koeninger commented on SPARK-17812:


1. we dont have lists, we have strings.  regexes and valid topic names have 
overlaps (dot is the obvious one).

2. Mapping directly to kafka method names means we don't have to come up with 
some other (weird and possibly overlapping) name when they add more ways to 
subscribe, we just use theirs.

3. I think this is a mess with kafka semantics for the reasons both you and I 
have already expressed.  At any rate, I think Michael already clearly punted 
the "starting X" case to a different topic.

4. I  think it's more than sufficiently clear as suggested, no one is going to 
expect that a specific offset they provided is going to be overruled by a 
general single default.   The implementation is also crystal clear - seek to 
the position identified by startingTime, then seek to any specific offsets for 
specific partitions

Yes, this is all bikeshedding, but it's bikeshedding that directly affects what 
people are actually able to do with the api.  Needlessly restricting it for 
reasons that have nothing to do with safety is just going to piss users off for 
no reason. Just because you don't have a use case that needs it, doesn't mean 
you should arbitrarily prevent users from doing it.

Please, just choose something and let me build it so that people can actually 
use the thing by the next release

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon

2016-10-13 Thread Eugene Zhulenev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573363#comment-15573363
 ] 

Eugene Zhulenev edited comment on SPARK-17555 at 10/13/16 10:21 PM:


[~brdwrd]

I had the same issue, and I figured out why. Basically you Spark executor is 
running in Docker container and writes shuffle blocks to /tmp directory, and 
passes this information to External Shuffle Service. But the problem is that 
shuffle service running in it's own Docker container. You have to share the 
save volume between your executors and shuffle service.

I added this config to shuffle service marathon configuration:

{code}
"container": {
"type": "DOCKER",
"docker": {
  "image": "mesosphere/spark:1.0.2-2.0.0",
  "network": "HOST"
},
"volumes": [
  {
"containerPath": "/var/data/spark",
"hostPath": "/var/data/spark",
"mode": "RW"
  }
]
  }
{code}

After that you need to specify *spark.mesos.executor.docker.volumes* and 
*spark.local.dir* parameters, so executors would write data to shared volume


was (Author: ezhulenev):
[~brdwrd]

I had the same issue, and I figured out why. Basically you Spark executor is 
running in Docker container and writes shuffle blocks to /tmp directory, and 
passes this information to External Shuffle Service. But the problem is that 
shuffle service running in it's own Docker container. You have to share the 
save volume between your executors and shuffle service.

I added this config to shuffle service marathon configuration:

{code}
"container": {
"type": "DOCKER",
"docker": {
  "image": "mesosphere/spark:1.0.2-2.0.0",
  "network": "HOST"
},
"volumes": [
  {
"containerPath": "/var/data/spark",
"hostPath": "/var/data/spark",
"mode": "RW"
  }
]
  }
{code}

After that you need to specify spark.mesos.executor.docker.volumes and 
spark.local.dir parameters, so executors would write data to shared volume

> ExternalShuffleBlockResolver fails randomly with External Shuffle Service  
> and Dynamic Resource Allocation on Mesos running under Marathon
> --
>
> Key: SPARK-17555
> URL: https://issues.apache.org/jira/browse/SPARK-17555
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0
> Environment: Mesos using docker with external shuffle service running 
> on Marathon. Running code from pyspark shell in client mode.
>Reporter: Brad Willard
>  Labels: docker, dynamic_allocation, mesos, shuffle
>
> External Shuffle Service throws these errors about 90% of the time. It seems 
> to die between stages and work inconsistently with these style of errors 
> about missing files. I've tested this same behavior with all the spark 
> versions listed on the jira using the pre-build hadoop 2.6 distributions from 
> the apache spark download page. I also want to mention everything works 
> successfully with dynamic resource allocation turned off.
> I have read other related bugs and have tried some of the 
> workaround/suggestions. Seems like some people have blamed the switch from 
> akka to netty which got me testing this in the 1.5* range with no luck. I'm 
> currently running these config option (informed by reading other bugs on jira 
> that seemed related to my problem). These settings have helped it work 
> sometimes instead of never.
> {code}
> spark.shuffle.service.port7338
> spark.shuffle.io.numConnectionsPerPeer  4
> spark.shuffle.io.connectionTimeout  18000s
> spark.shuffle.service.enabled   true
> spark.dynamicAllocation.enabled true
> {code}
> on the driver for pyspark submit I'm sending along this config
> {code}
> --conf 
> spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1
>  \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.mesos.coarse=true \
> --conf spark.cores.max=100 \
> --conf spark.executor.uri=$SPARK_EXECUTOR_URI \
> --conf spark.shuffle.service.port=7338 \
> --executor-memory 15g
> {code}
> Under Marathon I'm pinning each external shuffle service to an agent and 
> starting the service like this.
> {code}
> $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f 
> $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService*
> {code}
> On startup it seems like all is well
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Python 

[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573381#comment-15573381
 ] 

holdenk commented on SPARK-14212:
-

Please do! I think I've outlined the basic steps in my comment above, but let 
me know if you have any questions. Thanks for helping out :)

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573380#comment-15573380
 ] 

holdenk commented on SPARK-9487:


+1 to [~srowen]'s comment. I would not be surprised to see some test failures 
because of the implicit change in the default partitioning as a result - but 
for most of those just updating the results will be the right course of action. 
Let me know if you have any questions [~kanjilal] and welcome to PySpark :)

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv

2016-10-13 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-17923:
--

 Summary: dateFormat unexpected kwarg to df.write.csv
 Key: SPARK-17923
 URL: https://issues.apache.org/jira/browse/SPARK-17923
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Evan Zamir
Priority: Minor


Calling like this:
{code}writer.csv(path, header=header, sep=sep, compression=compression, 
dateFormat=date_format){code}

Getting the following error:
{code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code}

This error comes after being called with {code}date_format='-MM-dd'{code} 
as an argument.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12916) Support Row.fromSeq and Row.toSeq methods in pyspark

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573371#comment-15573371
 ] 

holdenk commented on SPARK-12916:
-

+1 with Hyukjin, I'll go ahead and close this as a "Won't Fix"

> Support Row.fromSeq and Row.toSeq methods in pyspark
> 
>
> Key: SPARK-12916
> URL: https://issues.apache.org/jira/browse/SPARK-12916
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: dataframe, pyspark, row, sql
>
> Pyspark should also have access to the Row functions like fromSeq and toSeq 
> which are exposed in the scala api. 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
> This will be useful when constructing custom columns from function called in 
> dataframes. A good example is present in the following SO threat: 
> http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe
> {code:python}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = 
>   Seq(x * y, z.head.toInt * y)
> val schema = StructType(df.schema.fields ++
>   Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
> val rows = df.rdd.map(r => Row.fromSeq(
>   r.toSeq ++
>   foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"
> val df2 = sqlContext.createDataFrame(rows, schema)
> df2.show
> // +---++---++-+
> // |  x|   y|  z| foo|  bar|
> // +---++---++-+
> // |  1| 3.0|  a| 3.0|291.0|
> // |  2|-1.0|  b|-2.0|-98.0|
> // |  3| 0.0|  c| 0.0|  0.0|
> // +---++---++-+
> {code}
> I am ready to work on this feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12916) Support Row.fromSeq and Row.toSeq methods in pyspark

2016-10-13 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-12916.
---
Resolution: Won't Fix

Since Row is now a subclass of Tuple we don't really need this anymore.

> Support Row.fromSeq and Row.toSeq methods in pyspark
> 
>
> Key: SPARK-12916
> URL: https://issues.apache.org/jira/browse/SPARK-12916
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: dataframe, pyspark, row, sql
>
> Pyspark should also have access to the Row functions like fromSeq and toSeq 
> which are exposed in the scala api. 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
> This will be useful when constructing custom columns from function called in 
> dataframes. A good example is present in the following SO threat: 
> http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe
> {code:python}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = 
>   Seq(x * y, z.head.toInt * y)
> val schema = StructType(df.schema.fields ++
>   Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
> val rows = df.rdd.map(r => Row.fromSeq(
>   r.toSeq ++
>   foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"
> val df2 = sqlContext.createDataFrame(rows, schema)
> df2.show
> // +---++---++-+
> // |  x|   y|  z| foo|  bar|
> // +---++---++-+
> // |  1| 3.0|  a| 3.0|291.0|
> // |  2|-1.0|  b|-2.0|-98.0|
> // |  3| 0.0|  c| 0.0|  0.0|
> // +---++---++-+
> {code}
> I am ready to work on this feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573367#comment-15573367
 ] 

holdenk commented on SPARK-16720:
-

Sounds good - go ahead and close this :)

> Loading CSV file with 2k+ columns fails during attribute resolution on action
> -
>
> Key: SPARK-16720
> URL: https://issues.apache.org/jira/browse/SPARK-16720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> Example shell for repro:
> {quote}
> scala> val df =spark.read.format("csv").option("header", 
> "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv")
> df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int 
> ... 2125 more fields]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Date,StringType,true), StructField(Lifetime Total 
> Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), 
> StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged 
> Users,IntegerType,true), StructField(Weekly Page Engaged 
> Users,IntegerType,true), StructField(28 Days Page Engaged 
> Users,IntegerType,true), StructField(Daily Like Sources - On Your 
> Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), 
> StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total 
> Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), 
> StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days 
> Organic Reach,IntegerType,true), StructField(Daily T...
> scala> df.take(1)
> [GIANT LIST OF COLUMNS]
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> 

[jira] [Commented] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon

2016-10-13 Thread Eugene Zhulenev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573363#comment-15573363
 ] 

Eugene Zhulenev commented on SPARK-17555:
-

[~brdwrd]

I had the same issue, and I figured out why. Basically you Spark executor is 
running in Docker container and writes shuffle blocks to /tmp directory, and 
passes this information to External Shuffle Service. But the problem is that 
shuffle service running in it's own Docker container. You have to share the 
save volume between your executors and shuffle service.

I added this config to shuffle service marathon configuration:

{code}
"container": {
"type": "DOCKER",
"docker": {
  "image": "mesosphere/spark:1.0.2-2.0.0",
  "network": "HOST"
},
"volumes": [
  {
"containerPath": "/var/data/spark",
"hostPath": "/var/data/spark",
"mode": "RW"
  }
]
  }
{code}

After that you need to specify spark.mesos.executor.docker.volumes and 
spark.local.dir parameters, so executors would write data to shared volume

> ExternalShuffleBlockResolver fails randomly with External Shuffle Service  
> and Dynamic Resource Allocation on Mesos running under Marathon
> --
>
> Key: SPARK-17555
> URL: https://issues.apache.org/jira/browse/SPARK-17555
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0
> Environment: Mesos using docker with external shuffle service running 
> on Marathon. Running code from pyspark shell in client mode.
>Reporter: Brad Willard
>  Labels: docker, dynamic_allocation, mesos, shuffle
>
> External Shuffle Service throws these errors about 90% of the time. It seems 
> to die between stages and work inconsistently with these style of errors 
> about missing files. I've tested this same behavior with all the spark 
> versions listed on the jira using the pre-build hadoop 2.6 distributions from 
> the apache spark download page. I also want to mention everything works 
> successfully with dynamic resource allocation turned off.
> I have read other related bugs and have tried some of the 
> workaround/suggestions. Seems like some people have blamed the switch from 
> akka to netty which got me testing this in the 1.5* range with no luck. I'm 
> currently running these config option (informed by reading other bugs on jira 
> that seemed related to my problem). These settings have helped it work 
> sometimes instead of never.
> {code}
> spark.shuffle.service.port7338
> spark.shuffle.io.numConnectionsPerPeer  4
> spark.shuffle.io.connectionTimeout  18000s
> spark.shuffle.service.enabled   true
> spark.dynamicAllocation.enabled true
> {code}
> on the driver for pyspark submit I'm sending along this config
> {code}
> --conf 
> spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1
>  \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.mesos.coarse=true \
> --conf spark.cores.max=100 \
> --conf spark.executor.uri=$SPARK_EXECUTOR_URI \
> --conf spark.shuffle.service.port=7338 \
> --executor-memory 15g
> {code}
> Under Marathon I'm pinning each external shuffle service to an agent and 
> starting the service like this.
> {code}
> $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f 
> $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService*
> {code}
> On startup it seems like all is well
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Python version 3.5.2 (default, Jul  2 2016 17:52:12)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now 
> >>> TASK_RUNNING
> 16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered 
> app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
> 16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now 
> TASK_RUNNING
> 16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered 
> app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service.
> 16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1
> 16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor 
> 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1)
> 16/09/15 11:35:56 

[jira] [Commented] (SPARK-10972) UDFs in SQL joins

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573364#comment-15573364
 ] 

holdenk commented on SPARK-10972:
-

I don't think that actually solves the problem the user is looking for. You 
could do a full cross product and filter after but that's pretty expensive.

> UDFs in SQL joins
> -
>
> Key: SPARK-10972
> URL: https://issues.apache.org/jira/browse/SPARK-10972
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Malak
>
> Currently expressions used to .join() in DataFrames are limited to column 
> names plus the operators exposed in org.apache.spark.sql.Column.
> It would be nice to be able to .join() based on a UDF, such as, say, 
> euclideanDistance(col1, col2) < 0.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573360#comment-15573360
 ] 

holdenk commented on SPARK-650:
---

Would people feel ok if we marked this as a duplicate of 636 since it does seem 
like this a subset of 636.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2016-10-13 Thread kanika dhuria (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17922:
--
Description: 
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.

Can we generate different classes with different names or is it expected to 
generate one class only? 
This is the somewhat I am trying to do 
{noformat} 
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._

  def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
//Initialize osgi
 (rows:Iterator[Row]) => {
 var outi = Iterator[Row]() 
 while(rows.hasNext) {
 val r = rows.next 
 outi = outi.++(Iterator(Row(r.get(0  
 } 
 //val ors = Row("abc")   
 //outi =outi.++( Iterator(ors))  
 outi
 }
  }

def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
 (d:DataFrame) => {
  val inType = d.schema
  val rdd = d.rdd.mapPartitions(exePart(outType))
  d.sqlContext.createDataFrame(rdd, outType)
}
   
  }

val df = spark.read.avro("file:///data/builds/a1.avro")
val df1 = df.select($"id2").filter(false)
val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
true)::Nil))).createOrReplaceTempView("tbl0")

spark.sql("insert overwrite table testtable select p1 from tbl0")
{noformat} 

  was:
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.

Can we generate different classes with different names or is it expected to 
generate one class only? 
This is the somewhat I am trying to do 

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._

  def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
//Initialize osgi
 (rows:Iterator[Row]) => {
 var outi = Iterator[Row]() 
 while(rows.hasNext) {
 val r = rows.next 
 outi = outi.++(Iterator(Row(r.get(0  
 } 
 //val ors = Row("abc")   
 //outi =outi.++( Iterator(ors))  
 outi
 }
  }

def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
 (d:DataFrame) => {
  val inType = d.schema
  val rdd = d.rdd.mapPartitions(exePart(outType))
  d.sqlContext.createDataFrame(rdd, outType)
}
   
  }

val df = spark.read.avro("file:///data/builds/a1.avro")
val df1 = df.select($"id2").filter(false)
val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
true)::Nil))).createOrReplaceTempView("tbl0")

spark.sql("insert overwrite table testtable select p1 from tbl0")



> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am 

[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2016-10-13 Thread kanika dhuria (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17922:
--
Description: 
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.

Can we generate different classes with different names or is it expected to 
generate one class only? 
This is the somewhat I am trying to do 

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._

  def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
//Initialize osgi
 (rows:Iterator[Row]) => {
 var outi = Iterator[Row]() 
 while(rows.hasNext) {
 val r = rows.next 
 outi = outi.++(Iterator(Row(r.get(0  
 } 
 //val ors = Row("abc")   
 //outi =outi.++( Iterator(ors))  
 outi
 }
  }

def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
 (d:DataFrame) => {
  val inType = d.schema
  val rdd = d.rdd.mapPartitions(exePart(outType))
  d.sqlContext.createDataFrame(rdd, outType)
}
   
  }

val df = spark.read.avro("file:///data/builds/a1.avro")
val df1 = df.select($"id2").filter(false)
val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
true)::Nil))).createOrReplaceTempView("tbl0")

spark.sql("insert overwrite table testtable select p1 from tbl0")


  was:
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.

Can we generate different classes with different names or is it expected to 
generate one class only.
This is the rough repro

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._

  def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
//Initialize osgi
 (rows:Iterator[Row]) => {
 var outi = Iterator[Row]() 
 while(rows.hasNext) {
 val r = rows.next 
 outi = outi.++(Iterator(Row(r.get(0  
 } 
 //val ors = Row("abc")   
 //outi =outi.++( Iterator(ors))  
 outi
 }
  }

def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
 (d:DataFrame) => {
  val inType = d.schema
  val rdd = d.rdd.mapPartitions(exePart(outType))
  d.sqlContext.createDataFrame(rdd, outType)
}
   
  }

val df = spark.read.avro("file:///data/builds/a1.avro")
val df1 = df.select($"id2").filter(false)
val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
true)::Nil))).createOrReplaceTempView("tbl0")

spark.sql("insert overwrite table testtable select p1 from tbl0")



> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am using dataframe transform. and within 

[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573341#comment-15573341
 ] 

Ofir Manor commented on SPARK-17812:


Thanks Cody! great to have a concrete example.
I've some comments, but its mostly bikeshedding
1.  subscribe vs. subscribePattern --> personally, I would combine them both to 
"subscribe" - no need to burden the user with the different Kafka API nuances. 
It can get a list of discreet topics or a pattern.
2. It would be much clearer if "assign" was called subscribeSomething, so the 
user would choose one "subscribe.." and one (or more) "starting...".
Not sure I have a good name though - subscribeCustom?
You can even use the regular subscribe for that (and be smarter with the 
pattern matching) - I think it would just work, and if someone tries to be 
funny (combine astrerix and partitions) we could just error
3. I like startingTime... pretty neat.
We could hypothetically add {{.option("startingMessages", long)}} to support 
Michael's "just start with a 1000 recent messages"...
4. As I said before, I'd rather have all starting* be mutual-exclusive. Yes, it 
blocks some edge cases, on purpose,  but make the API and code way clearer 
(think about startingMessage interacting with startingOffsets etc).
I think that it would be easier to regret and allow multiple starting* in the 
future (opening all sorts of esoteric combinations) than clean it up in the 
future if users find it confusing and not needed.
Anyway, as long as it is functional I'm good with it, even if it less aesthetic.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2016-10-13 Thread kanika dhuria (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17922:
--
Description: 
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.

Can we generate different classes with different names or is it expected to 
generate one class only.
This is the rough repro

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._

  def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
//Initialize osgi
 (rows:Iterator[Row]) => {
 var outi = Iterator[Row]() 
 while(rows.hasNext) {
 val r = rows.next 
 outi = outi.++(Iterator(Row(r.get(0  
 } 
 //val ors = Row("abc")   
 //outi =outi.++( Iterator(ors))  
 outi
 }
  }

def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
 (d:DataFrame) => {
  val inType = d.schema
  val rdd = d.rdd.mapPartitions(exePart(outType))
  d.sqlContext.createDataFrame(rdd, outType)
}
   
  }

val df = spark.read.avro("file:///data/builds/a1.avro")
val df1 = df.select($"id2").filter(false)
val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
true)::Nil))).createOrReplaceTempView("tbl0")

spark.sql("insert overwrite table testtable select p1 from tbl0")


  was:
I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.



> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am using dataframe transform. and within transform i use Osgi.
> Osgi replaces the thread context class loader to ContextFinder which looks at 
> all the class loaders in the stack to find out the new generated class and 
> finds the GeneratedClass with inner class GeneratedIterator byteclass 
> loader(instead of falling back to the byte class loader created by janino 
> compiler), since the class name is same that byte class loader loads the 
> class and returns GeneratedClass$GeneratedIterator instead of expected 
> GeneratedClass$UnsafeProjection.
> Can we generate different classes with different names or is it expected to 
> generate one class only.
> This is the rough repro
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import com.databricks.spark.avro._
>   def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
> //Initialize osgi
>  (rows:Iterator[Row]) => {
>  var outi = Iterator[Row]() 
>  while(rows.hasNext) {
>  val r = rows.next 
>  outi = outi.++(Iterator(Row(r.get(0  
>  } 
>  //val ors = Row("abc")   
>  //outi =outi.++( Iterator(ors))  
>  outi
>  }
>   

[jira] [Created] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2016-10-13 Thread kanika dhuria (JIRA)
kanika dhuria created SPARK-17922:
-

 Summary: ClassCastException java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
 Key: SPARK-17922
 URL: https://issues.apache.org/jira/browse/SPARK-17922
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: kanika dhuria


I am using spark 2.0
Seeing class loading issue because the whole stage code gen is generating 
multiple classes with same name as 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass"
I am using dataframe transform. and within transform i use Osgi.
Osgi replaces the thread context class loader to ContextFinder which looks at 
all the class loaders in the stack to find out the new generated class and 
finds the GeneratedClass with inner class GeneratedIterator byteclass 
loader(instead of falling back to the byte class loader created by janino 
compiler), since the class name is same that byte class loader loads the class 
and returns GeneratedClass$GeneratedIterator instead of expected 
GeneratedClass$UnsafeProjection.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception

2016-10-13 Thread Chris Perluss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Perluss updated SPARK-17460:
--
Component/s: SQL

> Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
> --
>
> Key: SPARK-17460
> URL: https://issues.apache.org/jira/browse/SPARK-17460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-10-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573287#comment-15573287
 ] 

holdenk commented on SPARK-15369:
-

I can understand the hesitancy to adopt this long term - I wish we could 
explore exposing this as a developer API instead but if that isn't something of 
interest I'll take a look at adding this as a spark package (or adding to one 
of the meta-spark packages like Bahir). In the meantime I'll do some more 
poking with Arrow and see if its getting to a point where we could selectively 
start using it in PySpark.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17812:
---
Comment: was deleted

(was: One other slightly ugly thing...

{noformat}
// starting topicpartitions, no explicit offset
.option("assign", """{"topicfoo": [0, 1],"topicbar": [0, 1]}"""

// do you allow specifying with explicit offsets in the same config option? 
// or force it all into startingOffsetForRealzYo?
.option("assign", """{ "topicfoo" :{ "0": 1234, "1": 4567 }, "topicbar" : { 
"0": 1234, "1": 4567 }}""")
{noformat})

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17731) Metrics for Structured Streaming

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573216#comment-15573216
 ] 

Apache Spark commented on SPARK-17731:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15472

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573166#comment-15573166
 ] 

Cody Koeninger edited comment on SPARK-17812 at 10/13/16 9:17 PM:
--

Here's my concrete suggestion:

3 mutually exclusive ways of subscribing:

{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

2 non-mutually exclusive ways of specifying starting position, explicit 
startingOffsets obviously take priority:

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""")
.option("startingTime", "earliest" | "latest" | long)
{noformat}
where long is a timestamp, work to be done on that later.
Note that even kafka 0.8 has a (really crappy based on log file modification 
time) api for time so later pursuing timestamps startingTime doesn't 
necessarily exclude it




was (Author: c...@koeninger.org):
Here's my concrete suggestion:

3 mutually exclusive ways of subscribing:

{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

2 non-mutually exclusive ways of specifying starting position, explicit 
startingOffsets obviously take priority:

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""")
.option("startingTime", "earliest" | "latest" | long)
{noformat}
where long is a timestamp, work to be done on that later.
Note that even kafka 0.8 has a (really crappy based on log file modification 
time) api for time so later pursuing timestamps startingTime doesn't 
necessarily exclude it



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17919:


Assignee: (was: Apache Spark)

> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573210#comment-15573210
 ] 

Apache Spark commented on SPARK-17919:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/15471

> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17919:


Assignee: Apache Spark

> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17661) Consolidate various listLeafFiles implementations

2016-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17661.
-
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.1.0

> Consolidate various listLeafFiles implementations
> -
>
> Key: SPARK-17661
> URL: https://issues.apache.org/jira/browse/SPARK-17661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.1.0
>
>
> There are 4 listLeafFiles-related functions in Spark:
> - ListingFileCatalog.listLeafFiles (which calls 
> HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is 
> greater than a threshold; if it is lower, then it has its own serial version 
> implemented)
> - HadoopFsRelation.listLeafFiles (called only by 
> HadoopFsRelation.listLeafFilesInParallel)
> - HadoopFsRelation.listLeafFilesInParallel (called only by 
> ListingFileCatalog.listLeafFiles)
> It is actually very confusing and error prone because there are effectively 
> two distinct implementations for the serial version of listing leaf files. 
> This code can be improved by:
> - Move all file listing code into ListingFileCatalog, since it is the only 
> class that needs this.
> - Keep only one function for listing files in serial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573166#comment-15573166
 ] 

Cody Koeninger commented on SPARK-17812:


Here's my concrete suggestion:

3 mutually exclusive ways of subscribing:

{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

2 non-mutually exclusive ways of specifying starting position, explicit 
startingOffsets obviously take priority:

{noformat}
.option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""")
.option("startingTime", "earliest" | "latest" | long)
{noformat}
where long is a timestamp, work to be done on that later.
Note that even kafka 0.8 has a (really crappy based on log file modification 
time) api for time so later pursuing timestamps startingTime doesn't 
necessarily exclude it



> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17921:


Assignee: (was: Apache Spark)

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17921:


Assignee: Apache Spark

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573163#comment-15573163
 ] 

Apache Spark commented on SPARK-17921:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15470

> checkpointLocation being set in memory streams fail after restart. Should 
> fail fast
> ---
>
> Key: SPARK-17921
> URL: https://issues.apache.org/jira/browse/SPARK-17921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The checkpointLocation option in memory streams in StructuredStreaming is not 
> used during recovery. However, it can use this location if it is being set. 
> However, during recovery, if this location was set, we get an exception 
> saying that we will not use this location for recovery, please delete it. 
> It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast

2016-10-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-17921:
---

 Summary: checkpointLocation being set in memory streams fail after 
restart. Should fail fast
 Key: SPARK-17921
 URL: https://issues.apache.org/jira/browse/SPARK-17921
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Burak Yavuz


The checkpointLocation option in memory streams in StructuredStreaming is not 
used during recovery. However, it can use this location if it is being set. 
However, during recovery, if this location was set, we get an exception saying 
that we will not use this location for recovery, please delete it. 

It's better to just fail before you start the stream in the first place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17731) Metrics for Structured Streaming

2016-10-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-17731.
---
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.0.2, 2.1.0  (was: 2.1.0)

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573109#comment-15573109
 ] 

Ofir Manor commented on SPARK-17812:


Regarding (1) - of course it is *all* data in the source, as of query start. 
Just the same as file system directory or a database table - I'm not sure a 
disclaimer that the directory or table could have had different data in the 
past adds anything but confusion...
Anyway, the startingOffset is confusing because, it seems you want a different 
parameter for "assign" --> to explicitly specify starting offsets.
For you use case, I would add:
5. Give me nnn messages (not last ones). We still do one of the above options 
(trying to go back nnn messages, somehow split between the topic-partitions 
involved), but not provide a more explicit guarantee like "last nnn". 
Generally, the distribution of messages to partitions don't have to be 
round-robin or uniform, it is strongly based on the key (example, could be 
state, could be URL etc).
Anyway, I haven't seen a concrete suggestion on how to specify offsets or 
timestamp, so I think that would be the next step in this ticket (I suggested 
you could condense all to one option to avoid dependencies between options, but 
I don't have an elegant "stringly" suggestion)

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572922#comment-15572922
 ] 

Cody Koeninger edited comment on SPARK-17812 at 10/13/16 8:33 PM:
--

Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're 
dealing with a stream that has multiple topics in it.  Agree we should break 
that out, focus on defining starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is 
orthogonal to specifying some offsets, you have a situation like this:

{noformat}
.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 
}, "topicbar" : { "0": 1234, "1": 4567 }}""")
{noformat}

where startingOffsetForRealzYo has a more reasonable name that conveys it is 
specifying starting offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, 
means you're going to have to pick a key that isn't a valid topic, or add yet 
another layer of indirection.  It also makes it harder to make the format 
consistent with SPARK-17829 (which seems like a good thing to keep consistent, 
I agree)

Obviously I think you should just change the name, but it's your show.






was (Author: c...@koeninger.org):
Sorry, I didn't see this comment until just now.

X offsets back per partition is not a reasonable proxy for time when you're 
dealing with a stream that has multiple topics in it.  Agree we should break 
that out, focus on defining starting offsets in this ticket.

The concern with startingOffsets naming is that, because auto.offset.reset is 
orthogonal to specifying some offsets, you have a situation like this:

.format("kafka")
.option("subscribePattern", "topic.*")
.option("startingOffset", "latest")
.option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 
}, "topicbar" : { "0": 1234, "1": 4567 }}""")

where startingOffsetForRealzYo has a more reasonable name that conveys it is 
specifying starting offsets, yet is not confusingly similar to startingOffset

Trying to hack it all into one json as an alternative, with a "default" topic, 
means you're going to have to pick a key that isn't a valid topic, or add yet 
another layer of indirection.  It also makes it harder to make the format 
consistent with SPARK-17829 (which seems like a good thing to keep consistent, 
I agree)

Obviously I think you should just change the name, but it's your show.





> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer

2016-10-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17834.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Fetch the earliest offsets manually in KafkaSource instead of counting on 
> KafkaConsumer
> ---
>
> Key: SPARK-17834
> URL: https://issues.apache.org/jira/browse/SPARK-17834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17900:


Assignee: Apache Spark  (was: Reynold Xin)

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573090#comment-15573090
 ] 

Apache Spark commented on SPARK-17900:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15469

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable

2016-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17900:


Assignee: Reynold Xin  (was: Apache Spark)

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573089#comment-15573089
 ] 

Cody Koeninger commented on SPARK-17812:


One other slightly ugly thing...

{noformat}
// starting topicpartitions, no explicit offset
.option("assign", """{"topicfoo": [0, 1],"topicbar": [0, 1]}"""

// do you allow specifying with explicit offsets in the same config option? 
// or force it all into startingOffsetForRealzYo?
.option("assign", """{ "topicfoo" :{ "0": 1234, "1": 4567 }, "topicbar" : { 
"0": 1234, "1": 4567 }}""")
{noformat}

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-13 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573056#comment-15573056
 ] 

Dmytro Bielievtsov commented on SPARK-10872:


[~sowen] Can you please give me some pointers in the source code so I can start 
working towards a pull request?

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> 

[jira] [Closed] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-10-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15369.
---
Resolution: Won't Fix

In the spirit of having more explicitly accept/rejects, and given the 
discussions so far on both this ticket and on the github pull request), I'm 
going to close this as won't fix for now. We can still continue to discuss here 
on the merits, but the reject is based on the following:

1. Maintenance cost of supporting another runtime.

2. Jython is years behind in terms of features with Cython (or even PyPy).

3. Jython cannot leverage any of the numeric tools available.

(In hindsight maybe PyPy support was also added prematurely.)



> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572976#comment-15572976
 ] 

Alessio edited comment on SPARK-15565 at 10/13/16 7:49 PM:
---

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that SPARK-17918 was a duplicate of 
SPARK-17810 and I'm glad this will be fixed.


was (Author: purple):
Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572976#comment-15572976
 ] 

Alessio commented on SPARK-15565:
-

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event

2016-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17917:
--
Priority: Minor  (was: Major)

Maybe, I suppose it will be a little tricky to define what the event is here, 
since the event is that something didn't happen. Still, whatever is triggering 
the log might reasonably trigger an event. I don't have a strong feeling on 
this partly because I'm not sure what the action then is -- kill the job?

> Convert 'Initial job has not accepted any resources..' logWarning to a 
> SparkListener event
> --
>
> Key: SPARK-17917
> URL: https://issues.apache.org/jira/browse/SPARK-17917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mario Briggs
>Priority: Minor
>
> When supporting Spark on a multi-tenant shared large cluster with quotas per 
> tenant, often a submitted taskSet might not get executors because quotas have 
> been exhausted (or) resources unavailable. In these situations, firing a 
> SparkListener event instead of just logging the issue (as done currently at 
> https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192),
>  would give applications/listeners an opportunity to handle this more 
> appropriately as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues have reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues have reported, but appears again 
> in 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such 
> errors: Spark 2.0.0 used to create the spark-warehouse folder within the 
> current directory (which was good) and didn't complain about such weird 
> paths, even because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2016-10-13 Thread James Norvell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Norvell updated SPARK-17920:
--
Attachment: (was: avro.avsc)

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> 

  1   2   3   >