[GitHub] spark issue #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-16 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/spark/pull/20511
  
I'm frustrated with the direction this has gone.

The new reader is much better than the old reader, which uses Hive 1.2. ORC 
1.4.3 had a pair of important, but not large or complex fixes. Yet, because of 
those fixes, now the entire new reader is being disabled by default in the 
upcoming Spark 2.2.

In particular, the Hive 1.2 ORC code has the following known problems:
* HIVE-11312 - Char predicate pushdown can filter all rows
* HIVE-13083 - Decimal columns can incorrectly suppress the isNonNull 
stream.
* ORC-101 -  Predicate pushdown on bloom filters use default charset rather 
than UTF8.
* ORC-135 - Predicate pushdown on timestamps doesn't correct for timezones


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20511: [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3

2018-02-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20511#discussion_r167950837
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -160,6 +160,16 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   }
 }
   }
+
+  // This is a test case for ORC-285
--- End diff --

After a bit of digging, the problem that ORC-285 fixes was introduced in 
HIVE-9711. HIVE-9711 is included in Hive 1.2.0, but doesn't create a problem 
unless the vectorized ORC reader is used. That didn't become the default 
behavior until HIVE-11417, which was released in Hive 2.1.0. So Spark's other 
readers are fine as long as they don't use the vectorized reader.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20511: [SPARK-23340][BUILD] Update ORC to 1.4.2

2018-02-10 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/spark/pull/20511
  
Sorry, I forgot to transition the jira issues for the ORC 1.4.3, so they 
didn't show up in the search from the notes.

The list of jiras closed by the 1.4.3 release is: https://s.apache.org/Fll8

There was an issue with the reader if you had an empty column of 
floats/doubles (ORC-285) and a compression issue that only seemed to hit LLAP 
(ORC-296).

We are about to start the ORC 1.5 release, but the ORC 1.4 release has been 
very stable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0

2017-08-15 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18640#discussion_r133248648
  
--- Diff: sql/core/pom.xml ---
@@ -87,6 +87,16 @@
 
 
 
+  org.apache.orc
+  orc-core
+  ${orc.classifier}
--- End diff --

@rxin Storage-API is a separately released artifact from the Hive project. 
Basically, Storage-API are the in-memory format for Hive's vectorization. You 
could draw the analogy that Storage-Api is for Hive what Arrow is for Drill. It 
allows formats to read and write directly in the format that is needed by the 
execution engine.

With the nohive classifier, ORC shades the storage-api jar into the ORC 
namespace so that it is compatible with any version of Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0

2017-08-08 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/spark/pull/18640
  
I would also comment that in the long term, Spark should move to using the 
vectorized reader in ORC's core. That would remove the dependence on ORC's 
mapreduce module, which provides row by row shims on top of the vectorized 
reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0

2017-08-07 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/spark/pull/18640
  
@rxin The ORC core library's dependency tree is aggressively kept as small 
as possible. I've gone through and excluded unnecessary jars from our 
dependencies. I also kick back pull requests that add unnecessary new 
dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13257: [SPARK-15474][SQL]ORC data source fails to write and rea...

2017-03-01 Thread omalley
Github user omalley commented on the issue:

https://github.com/apache/spark/pull/13257
  
Ok, I see the problem. Hive's OrcInputFormat has that property, because it 
was getting the schema from the ObjectInspector, which only came with the 
values. When I get a chance, let me look at what would be required to have you 
guys use the ORC project APIs directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org