[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-03-31 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-606869471
 
 
   @dbtsai That PR still relies on a dot in the column name, as I called out 
above. Not sure why you don't just parse the schema, like I was already doing 
in this PR?
   
   I may rebase, but as we'renon a 2.x fork any dependency on the v2 filters 
shipping in 3.x isn't really useful. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-02-14 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-586352041
 
 
   >so I'm thinking to submit a PR based on our internal version (a modified 
version of #22535) which is proven to be stable and in production for awhile.
   
   My main reservation is that #22535 relies on a dot in the name of the field, 
and so cannot support ORC. The key difference in this PR is that it actually 
inspects the type of the field and extends the same functionality to Parquet 
and ORC. It's also rebased for the current master already and so merges 
cleanly. I also added more tests.
   
   No, I intended to cherry pick this back after it merges, but if it doesn't 
get merge we'll probably end up using it on our fork much like you've done. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-02-09 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-583896631
 
 
   @dongjoon-hyun @dbtsai pinging again for review; it doesn't seem there is 
any progress on another PR and as @dbtsai pointed out these performance 
improvements can be very helpful for production workloads in their current 
state.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-28 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-579320490
 
 
   @dbtsai checking in again -- is there an edge case that you think doesn't 
work here? It would be nice to have updated filters, but seeing as you yourself 
are running code very much like this in a fork, wouldn't the right thing to do 
be to merge it upstream?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-15 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-574828934
 
 
   > Hello @emaynardigs ,
   > 
   > Thank you for your contribution, and I do value your work a lot. In fact, 
at Apple, we are still using an updated version of #22535 which is critical to 
our production job. As far as I know, Databirkcs's runtime also has an 
implementation with similar approach to tackle this issue.
   > 
   > The reason why I am inactive on my previous PR is that I feel adding 
nested support to the current filter api is a short term solution since the 
design doesn't consider this complex use-cases. For a better long term 
solution, I would like to create a new set of FilterV2 apis in DSv2 framework 
that makes nested columns as first class support. + @cloud-fan @rdblue @viirya 
for feedback on this.
   > 
   > I already started to work on FilterV2 api, and here is WIP code 
https://github.com/dbtsai/spark/pull/10/files . The change is bigger than I 
thought, and now, I am debating do we actually need a new FilterV2 framework?
   > 
   > Feedback and idea are welcome.
   > 
   > Thanks.
   
   Hey @dbtsai no worries, actually I suspected the silence was because you had 
moved this into a fork and were running with it :)
   
   Actually I think the core approach you took here is sufficient for most 
cases, right? The crux of my change was only porting it to the new APIs and 
looking at the schema itself to unpack nested columns instead of looking at the 
column name (needed this for ORC anyway). Then it was pretty easy to add ORC 
support as we use a fork of ORC internally while you guys use Parquet.
   
   What complex cases do you think break under this PR?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-15 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-574800519
 
 
   @dongjoon-hyun I see this has changes requested, do I need to make any 
changes here?
   
   Or is this just pending review?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-11 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-573387721
 
 
   Yes, surely we should close one PR. The other one is inactive, fails, tests, 
and doesn’t merge cleanly. This one has none of those issues and has more 
functionality. 
   
   I don’t mind closing this one out if the other PR can get us to the same 
place just as quickly, but that seems like it would take more work at this 
point? Either way, let’s make sure there is an active PR for this issue which 
can be merged in. As I have no control over any other PR this is my submission 
towards that end.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-11 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-573350984
 
 
   > First of all, you must add @dbtsai 's authorship by add a commit with his 
authorship.
   > 
   > The following is not a standard way to keep the authorship.
   > 
   > > Firstly, much of this PR is a rebase of #22535, much thanks to @dbtsai 
for his work.
   > 
   > Second, you need to address all the existing comment in the original PR. 
In the PR description, could you explain what is the improvement here from the 
original PR? If there is nothing new here, we had better close this one and 
asking @dbtsai to update his original PR.
   
   Hi #
   
   > First of all, you must add @dbtsai 's authorship by add a commit with his 
authorship.
   > 
   > The following is not a standard way to keep the authorship.
   > 
   > > Firstly, much of this PR is a rebase of #22535, much thanks to @dbtsai 
for his work.
   > 
   > Second, you need to address all the existing comment in the original PR. 
In the PR description, could you explain what is the improvement here from the 
original PR? If there is nothing new here, we had better close this one and 
asking @dbtsai to update his original PR.
   
   Hi @dongjoon-hyun, thanks for the feedback! Actually, when viewing the 
original PR I was unaware that @dbtsai is a member and, in fact, your 
colleague. Your concern definitely makes sense. 
   
   Firstly, I should say that there are actually no commits here under the 
original author's ownership; the code has diverged to the point now where, 
while I took some ideas & code from the original PR, it was easier to do all of 
this manually. I called this a "rebase", but it is a rebase only in the 
abstract sense that a lot of code was copied and updated for the latest master 
and not in the sense of any source control.
   
   Secondly, I'll try to elaborate on why I opened a new PR...
   
   1. The original PR was abandoned with no activity or reply from the author 
in a year. The original author _was_ asked to update his PR and did not respond.
   2. The original PR was flawed, failed several tests, and included some 
strange choices (like always splitting on a field that contained '.') that I 
did not agree with.
   3. The original PR did not extend to ORC and only worked for Parquet. My 
company uses yet another binary format that the original PR would have been 
incompatible with, but the new one can work with.
   
   As you pointed out, I have addressed the comments in the original PR. I've 
extended the functionality to ORC as well as Parquet, tested the functionality 
myself, and have written more unit tests (largely copied from the original PR) 
and am currently writing more pending the approval of the basic code here. I 
would not say there is nothing new here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields

2020-01-09 Thread GitBox
emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-572729309
 
 
   Yep, failure related, I was mistakenly only testing locally against the v2 
ORC code path before but Jenkins failed on the v1 `OrcFilters`. Updating the v1 
code path and re-testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org