[jira] [Work logged] (GOBBLIN-865) Add feature that enables PK-chunking in partition
[ https://issues.apache.org/jira/browse/GOBBLIN-865?focusedWorklogId=305999&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305999 ] ASF GitHub Bot logged work on GOBBLIN-865: -- Author: ASF GitHub Bot Created on: 04/Sep/19 01:04 Start Date: 04/Sep/19 01:04 Worklog Time Spent: 10m Work Description: arekusuri commented on pull request #2722: GOBBLIN-865: Add feature that enables PK-chunking in partition URL: https://github.com/apache/incubator-gobblin/pull/2722#discussion_r319713269 ## File path: gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java ## @@ -146,6 +156,95 @@ protected void addLineageSourceInfo(SourceState sourceState, SourceEntity entity @Override protected List generateWorkUnits(SourceEntity sourceEntity, SourceState state, long previousWatermark) { +String partitionType = state.getProp(PARTITION_TYPE, "PK_CHUNKING"); +if (partitionType.equals("PK_CHUNKING")) { + return generateWorkUnitsPkChunking(sourceEntity, state, previousWatermark); +} else { + return generateWorkUnitsStrategy(sourceEntity, state, previousWatermark); +} + } + + /** + * generate workUnit with noQuery=true + */ + private List generateWorkUnitsPkChunking(SourceEntity sourceEntity, SourceState state, long previousWatermark) { + List batchIdAndResultIds = executeQueryWithPkChunking(state, previousWatermark); + List ret = createWorkUnits(sourceEntity, state, batchIdAndResultIds); + return ret; + } + + private List executeQueryWithPkChunking( + SourceState sourceState, + long previousWatermark + ) throws RuntimeException { +Properties commonProperties = sourceState.getCommonProperties(); +Properties specProperties = sourceState.getSpecProperties(); +State state = new State(); +state.setProps(commonProperties, specProperties); +WorkUnit workUnit = WorkUnit.createEmpty(); +try { + WorkUnitState workUnitState = new WorkUnitState(workUnit, state); + workUnitState.setId("test" + new Random().nextInt()); + workUnitState.setProp(ENABLE_PK_CHUNKING_KEY, true); // set extractor enable pk chunking + int chunkSize = workUnitState.getPropAsInt(PARTITION_PK_CHUNKING_SIZE, DEFAULT_PK_CHUNKING_SIZE); Review comment: I was thinking we may want to keep 2nd level PK-chunking and better have different property for them. As we discussed, we don't think 2nd level PK-chunking makes sense. Will remove this property. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305999) Time Spent: 1h 50m (was: 1h 40m) > Add feature that enables PK-chunking in partition > -- > > Key: GOBBLIN-865 > URL: https://issues.apache.org/jira/browse/GOBBLIN-865 > Project: Apache Gobblin > Issue Type: Task >Reporter: Alex Li >Priority: Major > Labels: salesforce > Time Spent: 1h 50m > Remaining Estimate: 0h > > In SFDC(salesforce) connector, we have partitioning mechanisms to split a > giant query to multiple sub queries. There are 3 mechanisms: > * simple partition (equally split by time) > * dynamic pre-partition (generate histogram and split by row numbers) > * user specified partition (set up time range in job file) > However there are tables like Task and Contract are failing time to time to > fetch full data. > We may want to utilize PK-chunking to partition the query. > > The pk-chunking doc from SFDC - > [https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[GitHub] [incubator-gobblin] arekusuri commented on a change in pull request #2722: GOBBLIN-865: Add feature that enables PK-chunking in partition
arekusuri commented on a change in pull request #2722: GOBBLIN-865: Add feature that enables PK-chunking in partition URL: https://github.com/apache/incubator-gobblin/pull/2722#discussion_r319713269 ## File path: gobblin-salesforce/src/main/java/org/apache/gobblin/salesforce/SalesforceSource.java ## @@ -146,6 +156,95 @@ protected void addLineageSourceInfo(SourceState sourceState, SourceEntity entity @Override protected List generateWorkUnits(SourceEntity sourceEntity, SourceState state, long previousWatermark) { +String partitionType = state.getProp(PARTITION_TYPE, "PK_CHUNKING"); +if (partitionType.equals("PK_CHUNKING")) { + return generateWorkUnitsPkChunking(sourceEntity, state, previousWatermark); +} else { + return generateWorkUnitsStrategy(sourceEntity, state, previousWatermark); +} + } + + /** + * generate workUnit with noQuery=true + */ + private List generateWorkUnitsPkChunking(SourceEntity sourceEntity, SourceState state, long previousWatermark) { + List batchIdAndResultIds = executeQueryWithPkChunking(state, previousWatermark); + List ret = createWorkUnits(sourceEntity, state, batchIdAndResultIds); + return ret; + } + + private List executeQueryWithPkChunking( + SourceState sourceState, + long previousWatermark + ) throws RuntimeException { +Properties commonProperties = sourceState.getCommonProperties(); +Properties specProperties = sourceState.getSpecProperties(); +State state = new State(); +state.setProps(commonProperties, specProperties); +WorkUnit workUnit = WorkUnit.createEmpty(); +try { + WorkUnitState workUnitState = new WorkUnitState(workUnit, state); + workUnitState.setId("test" + new Random().nextInt()); + workUnitState.setProp(ENABLE_PK_CHUNKING_KEY, true); // set extractor enable pk chunking + int chunkSize = workUnitState.getPropAsInt(PARTITION_PK_CHUNKING_SIZE, DEFAULT_PK_CHUNKING_SIZE); Review comment: I was thinking we may want to keep 2nd level PK-chunking and better have different property for them. As we discussed, we don't think 2nd level PK-chunking makes sense. Will remove this property. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (GOBBLIN-707) combine & standardize all gobblin scripts into one master script & restructure configs accordingly
[ https://issues.apache.org/jira/browse/GOBBLIN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Sen updated GOBBLIN-707: Description: gobblin supports multiple modes of executions ( CLI, Standalone, cluster-master, cluster-worker, AWS, YARN, MR ) and various command lines utility to run cli and admin commands. The problem is each cli and execution mode has individual script to manage the service, which brings following problems. Having individual script introduces lot of issues # all scripts handles gobblin variables, user parameters differently, and its highly inconsistent among various different gobblin scripts, not to mention different features supported by different scripts. # functionality around start, stop, status checking and handling PID's among lot of other things, varies vastly as per the implementation of the script. # features like GC & JVM params, log4j file selection, classpath calculation, etc... exists in some gobblin scripts but not all, adding to inconsistent user experience. # code duplication: all the gobblin scripts share lot of common code to handle params, start, stop services, status checks, pid handling, etc... combining all the scripts into 1 not only makes maintenance easier but also brings clarity and consistency. # Basically, current 13 different scripts adds confusion to new user on how to use Gobblin or how to use it. Solution: 1. there can be one gobblin.sh script to handle all gobblin commands and deployment options as per following signature. NOTE: This {{gobblin.sh }} {{gobblin.sh }} {{commands values: admin, cli, statestore-check, statestore-clean, historystore-manager, classpath}} {{service values: standalone, cluster-master, cluster-worker, aws, yarn, mr, service}} with above change, following becomes valid command. {code:java} # all under GobblinCli class gobblin run listQuickApps –> gobblin cli run listQuickApps gobblin run -> gobblin cli run # class: JobStateToJsonConverter statestore-checker.sh -> gobblin cli job-state-to-json # class: StateStoreCleaner statestore-clean.sh -> the class is depricated so no need to migrate this over. # class: DatabaseJobHistoryStoreSchemaManager historystore-manager.sh -> gobblin cli job-store-schema-manager # class: Cli gobblin-admin.sh-> gobblin cli admin # all gobblin deployment modes gobblin-cluster-master.sh -> gobblin service cluster-master start|stop|status gobblin-cluster-worker.sh -> gobblin service cluster-worker start|stop|status gobblin-compaction.sh -> gobblin-compaction.sh ( kept as it is for now, can be migrated to new script framework) gobblin-mapreduce.sh-> gobblin service mapreduce start|stop|status gobblin-service.sh -> gobblin service service-manager start|stop|status gobblin-standalone.sh-> gobblin service standalone start|stop|status gobblin-yarn.sh -> gobblin service yarn start|stop|status {code} 2. Also all configurations for each mode needs to be structured and de-duped accordingly to make it clear on which config will be picked up for which execution mode. This would be well defined in command help instructions. {color:#ff} NOTE: this refactoring adds all cli and service commands to gobblin.sh and hence changes the syntax for all commands and services.{color} was: gobblin supports multiple modes of executions ( CLI, Standalone, cluster-master, cluster-worker, AWS, YARN, MR ) and various command lines utility to run cli and admin commands. The problem is each cli and execution mode has individual script to manage the service, which brings following problems. Having individual script introduces lot of issues # all scripts handles gobblin variables, user parameters differently, and its highly inconsistent among various different gobblin scripts, not to mention different features supported by different scripts. # functionality around start, stop, status checking and handling PID's among lot of other things, varies vastly as per the implementation of the script. # features like GC & JVM params, log4j file selection, classpath calculation, etc... exists in some gobblin scripts but not all, adding to inconsistent user experience. # code duplication: all the gobblin scripts share lot of common code to handle params, start, stop services, status checks, pid handling, etc... combining all the scripts into 1 not only makes maintenance easier but also brings clarity and consistency. # Basically, current 13 different scripts adds confusion to new user on how to use Gobblin or how to use it. Solution: 1. there can be one gobblin.sh script to handle all gobblin commands and deployment options as per following signature. NOTE: This {{gobblin.sh }} {{gobblin.sh }} {{commands values: admin, cli, statestore-check, statestore-clean, historystore-manager, classpath}} {{service values: standalone, c
[jira] [Updated] (GOBBLIN-707) combine & standardize all gobblin scripts into one master script & restructure configs accordingly
[ https://issues.apache.org/jira/browse/GOBBLIN-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Sen updated GOBBLIN-707: Description: gobblin supports multiple modes of executions ( CLI, Standalone, cluster-master, cluster-worker, AWS, YARN, MR ) and various command lines utility to run cli and admin commands. The problem is each cli and execution mode has individual script to manage the service, which brings following problems. Having individual script introduces lot of issues # all scripts handles gobblin variables, user parameters differently, and its highly inconsistent among various different gobblin scripts, not to mention different features supported by different scripts. # functionality around start, stop, status checking and handling PID's among lot of other things, varies vastly as per the implementation of the script. # features like GC & JVM params, log4j file selection, classpath calculation, etc... exists in some gobblin scripts but not all, adding to inconsistent user experience. # code duplication: all the gobblin scripts share lot of common code to handle params, start, stop services, status checks, pid handling, etc... combining all the scripts into 1 not only makes maintenance easier but also brings clarity and consistency. # Basically, current 13 different scripts adds confusion to new user on how to use Gobblin or how to use it. Solution: 1. there can be one gobblin.sh script to handle all gobblin commands and deployment options as per following signature. NOTE: This {{gobblin.sh }} {{gobblin.sh }} {{commands values: admin, cli, statestore-check, statestore-clean, historystore-manager, classpath}} {{service values: standalone, cluster-master, cluster-worker, aws, yarn, mr, service}} with above change, following becomes valid command. {code:java} # all under GobblinCli class gobblin run listQuickApps –> gobblin cli run listQuickApps gobblin run listQuickApps –> gobblin cli run listQuickApps gobblin run -> gobblin cli run # class: JobStateToJsonConverter statestore-checker.sh -> gobblin statestore-checker # class: StateStoreCleaner statestore-clean.sh -> gobblin statestore-clean # class: DatabaseJobHistoryStoreSchemaManager historystore-manager.sh -> gobblin historystore-manager # class: Cli gobblin-admin.sh-> gobblin admin # all gobblin deployment modes gobblin-cluster-master.sh -> gobblin cluster-mater start|stop|status gobblin-cluster-worker.sh -> gobblin cluster-mater start|stop|status gobblin-compaction.sh -> gobblin cluster-mater start|stop|status gobblin-env.sh -> gobblin cluster-mater start|stop|status gobblin-mapreduce.sh-> gobblin cluster-mater start|stop|status gobblin-service.sh -> gobblin cluster-mater start|stop|status gobblin-standalone.sh -> gobblin cluster-mater start|stop|status gobblin-yarn.sh -> gobblin cluster-mater start|stop|status {code} 2. Also configs needs to be structured and deduped accordingly to make it clear on which config will be picked up for which execution mode. {color:#ff} NOTE: this refactoring adds all cli and service commands to gobblin.sh and hence changes the syntax for all commands and services.{color} was: gobblin supports multiple modes of executions ( CLI, Standalone, cluster-master, cluster-worker, AWS, YARN, MR ) and various command lines utility to run cli and admin commands. There is a individual script for each of them. Having individual script introduces lot of issues # all scripts handles gobblin variables, user parameters differently, and its highly inconsistent among various different gobblin scripts # functionality around start, stop, status checking and handling PID's among lot of other things, varies vastly as per the implementation of the script. # features like GC & JVM params, log4j file selection, classpath calculation, etc... exists in some gobblin scripts but not all, adding to inconsistent user experience. # maintaining total 13 script would be too much effort. Also all the gobblin scripts share lot of common code to handle params, start, stop services, status checks, pid handling, etc... combining all the scripts into 1 not only makes maintenance easier but also brings clarity and consistency. Solution: 1. there can be one gobblin.sh script to handle all gobblin commands and deployment options as per following signature. NOTE: This {{gobblin.sh }} {{gobblin.sh }} {{commands values: admin, cli, statestore-check, statestore-clean, historystore-manager, classpath}} {{service values: standalone, cluster-master, cluster-worker, aws, yarn, mr, service}} with above change, following becomes valid command. {code:java} # all under GobblinCli class gobblin run listQuickApps –> gobblin cli run listQuickApps gobblin run listQuickApps –> gobblin cli run listQuickApps gobblin run -> gobblin cli run # cl
[GitHub] [incubator-gobblin] asfgit closed pull request #2719: [GOBBLIN-863]Handle race condition issue for hive registration
asfgit closed pull request #2719: [GOBBLIN-863]Handle race condition issue for hive registration URL: https://github.com/apache/incubator-gobblin/pull/2719 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Work logged] (GOBBLIN-863) Handle race condition between concurrent Gobblin tasks performing Hive registration
[ https://issues.apache.org/jira/browse/GOBBLIN-863?focusedWorklogId=305649&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-305649 ] ASF GitHub Bot logged work on GOBBLIN-863: -- Author: ASF GitHub Bot Created on: 03/Sep/19 15:30 Start Date: 03/Sep/19 15:30 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2719: [GOBBLIN-863]Handle race condition issue for hive registration URL: https://github.com/apache/incubator-gobblin/pull/2719 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 305649) Time Spent: 4h (was: 3h 50m) > Handle race condition between concurrent Gobblin tasks performing Hive > registration > --- > > Key: GOBBLIN-863 > URL: https://issues.apache.org/jira/browse/GOBBLIN-863 > Project: Apache Gobblin > Issue Type: Task > Components: hive-registration >Reporter: Zihan Li >Assignee: Abhishek Tiwari >Priority: Major > Time Spent: 4h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)