[jira] [Created] (SPARK-12121) Remote Spark init.sh called from spark_ec2.py points to incorrect prebuilt image URL

2015-12-03 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-12121:


 Summary: Remote Spark init.sh called from spark_ec2.py points to 
incorrect prebuilt image URL
 Key: SPARK-12121
 URL: https://issues.apache.org/jira/browse/SPARK-12121
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.5.2
Reporter: Andre Schumacher



The tar ball for 1.5.2 with Hadoop 1 (the default) has the scala version 
appended to the file name, which leads to spark_ec2.py failing on starting up 
the cluster. Here is the record from the S3 contents:


spark-1.5.2-bin-hadoop1-scala2.11.tgz
2015-11-10T06:45:17.000Z
"056fc68e549db27d986da707f19e39c8-4"
234574403
STANDARD


Maybe one could provide one without the scala suffix (default)?

A workaround is to use set the Hadoop version to a version different from 1 
when calling spark-ec2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6462) UpdateStateByKey should allow inner join of new with old keys

2015-03-23 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-6462:
---

 Summary: UpdateStateByKey should allow inner join of new with old 
keys
 Key: SPARK-6462
 URL: https://issues.apache.org/jira/browse/SPARK-6462
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Andre Schumacher



In a nutshell: provide a (inner join) instead of a cogroup for updateStateByKey 
in StateDStream.

Details:

It is common to read data (saw weblog data) from a streaming source (say Kafka) 
and each time update the state of a relatively small number of keys.

If only the state changes need to be propagated to a downstream sink then one 
could avoid filtering out unchanged state in the user program and instead 
provide this functionality in the API (say by adding a updateStateChangesByKey 
method).

Note that this is related but not identical to:
https://issues.apache.org/jira/browse/SPARK-2629



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-11-03 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194576#comment-14194576
 ] 

Andre Schumacher commented on SPARK-2620:
-

I also bumped into this issue (on Spark 1.1.0) and it is kind of extremely 
annoying although it only affects the REPL. Is anybody actively working on 
reolving this? Given it's already a few months old: are there any blockers for 
making this work? Matei mentioned the way code is wrapped inside the REPL.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2195) Parquet extraMetadata can contain key information

2014-06-20 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038661#comment-14038661
 ] 

Andre Schumacher commented on SPARK-2195:
-

Since commit
https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22
the path is no longer stored in the extraMetadata. So I guess this issue can be 
closed?

 Parquet extraMetadata can contain key information
 -

 Key: SPARK-2195
 URL: https://issues.apache.org/jira/browse/SPARK-2195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Priority: Blocker

 {code}
 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: 
 file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet
  start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same 
 as file fileSchema: message root {
   optional int32 id;
   optional binary title;
   optional int64 modified;
   optional binary text;
   optional binary username;
 }
  extraMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS!!! } 
 readSupportMetadata: 
 {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true),
  StructField(title,StringType,true), StructField(modified,LongType,true), 
 StructField(text,StringType,true), StructField(username,StringType,true))), 
 path= MY AWS KEYS 
 ***}}
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2112) ParquetTypesConverter should not create its own conf

2014-06-20 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038666#comment-14038666
 ] 

Andre Schumacher commented on SPARK-2112:
-

Since commit
https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22
the SparkContext's Hadoop configuration should be used when reading metadata 
from the file source. I wasn't yet able to test this with say S3 bucket names.

Are the the S3 credentials copied from SparkConfig to its Hadoop configuration? 
 If someone could confirm this to be working we could close this issue.

 ParquetTypesConverter should not create its own conf
 

 Key: SPARK-2112
 URL: https://issues.apache.org/jira/browse/SPARK-2112
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust

 [~adav]: this actually makes it so that we can't use S3 credentials set in 
 the SparkContext, or add new FileSystems at runtime, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1982) saveToParquetFile doesn't support ByteType

2014-06-01 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014945#comment-14014945
 ] 

Andre Schumacher commented on SPARK-1982:
-

It turns out that ByteType primitive types weren't correctly treated earlier. 
Since Parquet doesn't have these one fix is to use fixed-length byte arrays 
(which are treated as primitives also). This is fine until there will be 
support for nested types. Even then I think one may want to treat these as 
actual arrays and not primitives.

Anyway.. PR available here: https://github.com/apache/spark/pull/934

 saveToParquetFile doesn't support ByteType
 --

 Key: SPARK-1982
 URL: https://issues.apache.org/jira/browse/SPARK-1982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Andre Schumacher

 {code}
 java.lang.RuntimeException: Unsupported datatype ByteType
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-24 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008193#comment-14008193
 ] 

Andre Schumacher commented on SPARK-1913:
-

Sigh.. can't believe I missed this when adding the original tests. Good that 
there is a fix now.

 Parquet table column pruning error caused by filter pushdown
 

 Key: SPARK-1913
 URL: https://issues.apache.org/jira/browse/SPARK-1913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: mac os 10.9.2
Reporter: Chen Chao

 When scanning Parquet tables, attributes referenced only in predicates that 
 are pushed down are not passed to the `ParquetTableScan` operator and causes 
 exception. Verified in the {{sbt hive/console}}:
 {code}
 loadTestTable(src)
 table(src).saveAsParquetFile(src.parquet)
 parquetFile(src.parquet).registerAsTable(src_parquet)
 hql(SELECT value FROM src_parquet WHERE key  10).collect().foreach(println)
 {code}
 Exception
 {code}
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.IllegalArgumentException: Column key does not exist.
   at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
   at 
 org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
   at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
   ... 28 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-24 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher updated SPARK-1913:


Affects Version/s: (was: 1.0.0)
   1.1.0

 Parquet table column pruning error caused by filter pushdown
 

 Key: SPARK-1913
 URL: https://issues.apache.org/jira/browse/SPARK-1913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: mac os 10.9.2
Reporter: Chen Chao

 When scanning Parquet tables, attributes referenced only in predicates that 
 are pushed down are not passed to the `ParquetTableScan` operator and causes 
 exception. Verified in the {{sbt hive/console}}:
 {code}
 loadTestTable(src)
 table(src).saveAsParquetFile(src.parquet)
 parquetFile(src.parquet).registerAsTable(src_parquet)
 hql(SELECT value FROM src_parquet WHERE key  10).collect().foreach(println)
 {code}
 Exception
 {code}
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.IllegalArgumentException: Column key does not exist.
   at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
   at 
 org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
   at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
   ... 28 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1487) Support record filtering via predicate pushdown in Parquet

2014-05-24 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher updated SPARK-1487:


Affects Version/s: (was: 1.0.0)
   1.1.0

 Support record filtering via predicate pushdown in Parquet
 --

 Key: SPARK-1487
 URL: https://issues.apache.org/jira/browse/SPARK-1487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Parquet has support for column filters, which can be used to avoid reading 
 and de-serializing records that fail the column filter condition. This can 
 lead to potentially large savings, depending on the number of columns 
 filtered by and how many records actually pass the filter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1487) Support record filtering via predicate pushdown in Parquet

2014-05-24 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher closed SPARK-1487.
---

Resolution: Fixed

 Support record filtering via predicate pushdown in Parquet
 --

 Key: SPARK-1487
 URL: https://issues.apache.org/jira/browse/SPARK-1487
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Parquet has support for column filters, which can be used to avoid reading 
 and de-serializing records that fail the column filter condition. This can 
 lead to potentially large savings, depending on the number of columns 
 filtered by and how many records actually pass the filter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) DataType should contain nullable bit

2014-04-28 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983963#comment-13983963
 ] 

Andre Schumacher commented on SPARK-1649:
-

OK, I now understand that this would be a bigger change.

It's not just struct fields for nested types, array element types, map value 
value types, etc. IMHO it would be cleaner to have it inside the DataType. But 
since this seems to be mostly relevant only for nested types could one have a 
special DataType for them, something like NestedDataType(val nullable: 
Boolean) extends DataType?

 DataType should contain nullable bit
 

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) DataType should contain nullable bit

2014-04-28 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983985#comment-13983985
 ] 

Andre Schumacher commented on SPARK-1649:
-

Thinking about it a bit longer.. could Nullable maybe be a mixin? But what 
should the default be, nullable or not nullable?

 DataType should contain nullable bit
 

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1649) DataType should contain nullable bit

2014-04-27 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-1649:
---

 Summary: DataType should contain nullable bit
 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical


For the underlying storage layer it would simplify things such as schema 
conversions, predicate filter determination and such to record in the data type 
itself whether a column can be nullable. So the DataType type could look like 
like this:

abstract class DataType(nullable: Boolean = true)

Concrete subclasses could then override the nullable val. Mostly this could be 
left as the default but when types can be contained in nested types one could 
optimize for, e.g., arrays with elements that are nullable and those that are 
not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1293) Support for reading/writing complex types in Parquet

2014-04-14 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968332#comment-13968332
 ] 

Andre Schumacher commented on SPARK-1293:
-

There is a preliminary pull request now: 
https://github.com/apache/spark/pull/360

 Support for reading/writing complex types in Parquet
 

 Key: SPARK-1293
 URL: https://issues.apache.org/jira/browse/SPARK-1293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Andre Schumacher
 Fix For: 1.1.0


 Complex types include: Arrays, Maps, and Nested rows (structs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1383) Spark-SQL: ParquetRelation improvements

2014-04-04 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher resolved SPARK-1383.
-

Resolution: Fixed

Fixed by 
https://github.com/apache/spark/commit/fbebaedf26286ee8a75065822a3af1148351f828

 Spark-SQL: ParquetRelation improvements
 ---

 Key: SPARK-1383
 URL: https://issues.apache.org/jira/browse/SPARK-1383
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher

 Improve Spark-SQL's ParquetRelation as follows:
 - Instead of files a ParquetRelation is should be backed by a directory, 
 which simplifies importing data from other sources
 - InsertIntoParquetTable operation should supports switching between 
 overwriting or appending (at least in HiveQL)
 - tests should use the new API
 - Parquet logging should be forwarded to Log4J
 - It should be possible to enable compression (default compression for 
 Parquet files: GZIP, as in parquet-mr)
 - OverwriteCatalog should support dropping of tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1367) NPE when joining Parquet Relations

2014-04-02 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher reassigned SPARK-1367:
---

Assignee: Andre Schumacher  (was: Michael Armbrust)

 NPE when joining Parquet Relations
 --

 Key: SPARK-1367
 URL: https://issues.apache.org/jira/browse/SPARK-1367
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Andre Schumacher
Priority: Blocker
 Fix For: 1.0.0


 {code}
   test(self-join parquet files) {
 val x = ParquetTestData.testData.subquery('x)
 val y = ParquetTestData.testData.newInstance.subquery('y)
 val query = x.join(y).where(x.myint.attr === y.myint.attr)
 query.collect()
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)