[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2018-02-14 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364802#comment-16364802
 ] 

James Taylor commented on PHOENIX-4344:
---

Yes, you’re right - that’s one of the limitations for indexes on views - the 
DML must be done on the leaf views. If you do that, everything will just work 
(famous last words :-) ).

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2018-02-14 Thread Geoffrey Jacoby (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364797#comment-16364797
 ] 

Geoffrey Jacoby commented on PHOENIX-4344:
--

[~jamestaylor] - if I remember right, normal Phoenix deletes already have an 
issue where deleting from a base table won't delete from the views – you have 
to delete from the view to get it to "do the right thing". Given that, would it 
be OK to require the user to use the view name in a DELETE MapReduce query if 
they want the view and its indexes to be updated? 

This could be changed in the future if Phoenix deletes get smarter about 
finding and deleting from child views/indexes. 

For the particular use case that [~akshita.malhotra] and I have in mind for 
this feature, the users will definitely know the views they want to delete 
from. 

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2018-02-13 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363598#comment-16363598
 ] 

James Taylor commented on PHOENIX-4344:
---

Phoenix will do a point delete (i.e. the Phoenix client will issue an HBase 
Delete with the full row key) because it thinks it has values for all the 
columns that make up the primary key of the base table. In this case, it 
doesn't need to issue a scan at all. The problem is, Phoenix doesn't know that 
there are derived views that have extended the PK.

One solution would be to have a declaration on the base table that it would 
never be used to upsert data directly. Something like declaring it ABSTRACT. In 
that case, if you deleted from it, Phoenix could know to issue a scan instead 
of trying to optimize it as a point delete.

Another solution would be to issue the delete statement against the view in the 
MR job. Since the view has extended the PK, Phoenix wouldn't issue a point 
delete, but would issue a scan. That might not be feasible, though, as it'd be 
tricky to know all the views.

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2018-02-13 Thread Akshita Malhotra (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363286#comment-16363286
 ] 

Akshita Malhotra commented on PHOENIX-4344:
---

[~jamestaylor] Can you explain why would it do a point scan? Maybe I am 
thinking in the wrong direction but as [~gjacoby] explained, even if the 
initial delete is deleting over a non PK column, when a point phoenix delete 
query is being issued, I can provide the PK information (obtain from the map 
reduce scan) along with the extra predicate that would include the non-PK 
column. 

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2017-11-02 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236914#comment-16236914
 ] 

James Taylor commented on PHOENIX-4344:
---

I see - yes, you're right - that would work. It'd do a point scan for each row 
if there was a non PK column as it'd need to look up that value to maintain the 
index. It'd work, it'd just be slow.

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2017-11-02 Thread Geoffrey Jacoby (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236907#comment-16236907
 ] 

Geoffrey Jacoby commented on PHOENIX-4344:
--

I don't see how Option 1 is problematic for indexes on non-PK columns, because 
it's internally using the Phoenix JDBC API and so going through all the same 
index-handling logic that a point-delete query issued from outside MapReduce 
would be doing. 

Let's say that I have a table ENTITY_HISTORY with a compound primary key (Key1, 
Key2). 

I create my MapReduce job with a query like "DELETE FROM ENTITY_HISTORY WHERE 
Key1 > 'aaa'"

That delete would be converted to a select, and the MapReduce job would iterate 
row by row over the result set. For each row, a new Delete query would be built 
using that row's PK, e.g "DELETE FROM ENTITY_HISTORY WHERE Key1 = 'foo' and 
Key2 = 'bar'" and executed using a PhoenixConnection (probably with some kind 
of commit batching).

I'm somewhat concerned about the perf, but the correctness seems sound to me -- 
am I missing an issue? 

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2017-11-02 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236856#comment-16236856
 ] 

James Taylor commented on PHOENIX-4344:
---

I'd go with Option #2. Option #1 will be problematic for tables with indexes on 
non pk columns. If you can tack on the correct RVC (or perhaps did below the 
Phoenix API and set the start/stop row of the Scan) based on the info in the 
QueryPlan, then the delete logic will all be handled completely by 
DeleteCompiler. You just need to grab the mutations using 
PhoenixRuntime.getUncommittedDataIterator(). You might just use 
FormatToBytesWritableMapper for inspiration/code borrowing.



> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2017-11-02 Thread Geoffrey Jacoby (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236759#comment-16236759
 ] 

Geoffrey Jacoby commented on PHOENIX-4344:
--

Some thoughts, [~jamestaylor]

I want this to be usable for generic DELETE queries without the need for 
hand-written DBWritable subclasses.

MapReduce goes line by line, rather than by Mapper Task/Scan, so while the 
client would be issuing a broad DELETE query, the mapper itself would either be:

1. Issuing point DELETE Phoenix queries by the complete primary key derived 
from a SELECT the MapReduce is iterating over 
(Mapper)
OR
2. Issuing DELETE mutations down to several HTables via MultiHFileOutputFormat 
from a DELETE the MapReduce is iterating over
(Mapper)

FormatToBytesWritableMapper relies heavily on a LineParser interface, and the 
only choices appear to be CsvLineParser, JsonLineParser, and RegexLineParser. 
That means that in either case the complete row key would have to be built by a 
new ResultSetLineParser that can take in a ResultSet and parse it into an 
intermediate form suitable making either DELETE DML (Option 1) or Delete 
Mutations (Option 2). The former would just need to grab the row key 
components, while the latter would potentially need everything, because an 
index can be on any column. 

Also either way, we need a concrete generalized subclass of the abstract 
DBWritable. 

Option 1 seems considerably simpler/higher level, while Option 2 seems more 
efficient

> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4344) MapReduce Delete Support

2017-11-01 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235025#comment-16235025
 ] 

James Taylor commented on PHOENIX-4344:
---

Here's a possible way to proceed with this:
- In PhoenixInputFormat, we drive things based on a QueryPlan. I think the 
first thing we'll need is PHOENIX-4342 - providing a way of getting the 
underlying QueryPlan from a MutationPlan (which is what you get when you 
compile a DELETE statement).
- Create different implementation of PhoenixInputFormat.getQueryPlan() that 
compiles the DELETE statement and gets the QueryPlan from the MutationPlan.
- Keep the same logic that ends up setting up on mapper per scan in the 
QueryPlan
- Instead of executing each individual scan, you'd want to execute a DELETE 
statement bounded by the start/stop key of each scan
- Execute code just like FormatToBytesWritableMapper to put together the list 
of Delete mutations
- Make sure we've got the write-to-multiple HTables working correctly (I 
believe MultiHfileOutputFormat does that)



> MapReduce Delete Support
> 
>
> Key: PHOENIX-4344
> URL: https://issues.apache.org/jira/browse/PHOENIX-4344
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.12.0
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
>
> Phoenix already has the ability to use MapReduce for asynchronous handling of 
> long-running SELECTs. It would be really useful to have this capability for 
> long-running DELETEs, particularly of tables with indexes where using HBase's 
> own MapReduce integration would be prohibitively complicated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)