[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007908#comment-17007908 ] lamber-ken commented on HUDI-486: - Let me summarize the reproduce steps. 1, use git checkout commit e1e5fe33249bf511486073dd9cf48e5b7ea14816 2, build source {code:java} mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true {code} 3, setup docker {code:java} cd docker && ./setup_demo.sh{code} 4, generate data {code:java} cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P {code} 5, go into the container {code:java} docker exec -it adhoc-2 /bin/bash {code} 6, comsume kafka data && sync to hive {code:java} spark-submit \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \ --storage-type COPY_ON_WRITE \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --source-ordering-field ts \ --target-base-path /user/hive/warehouse/stock_ticks_cow \ --target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider /var/hoodie/ws/hudi-hive/run_sync_tool.sh \ --jdbc-url jdbc:hive2://hiveserver:1 \ --user hive \ --pass hive \ --partitioned-by dt \ --base-path /user/hive/warehouse/stock_ticks_cow \ --database default \ --table stock_ticks_cow {code} 7, create incr_pull.txt {code:java} select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621' {code} 8, execute org.apache.hudi.utilities.HiveIncrementalPuller {code:java} java -cp ./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/commons-io-2.4.jar:$HUDI_UTILITIES_BUNDLE \ org.apache.hudi.utilities.HiveIncrementalPuller \ --hiveUrl jdbc:hive2://hiveserver:1 \ --hiveUser hive \ --hivePass hive \ --extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt \ --sourceDb default \ --sourceTable stock_ticks_cow \ --targetDb default \ --tmpdb default \ --targetTable tempTable \ --fromCommitTime 0 \ --maxCommits 1 {code} > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006710#comment-17006710 ] Pratyaksh Sharma commented on HUDI-486: --- Sure [~lamber-ken]. I am going through the diff of this commit that you mentioned above. I will sync with you in some time. > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006653#comment-17006653 ] lamber-ken commented on HUDI-486: - Hi [~Pratyaksh], I had figured it out that it's commit[1] affects the hoodie-utilities.jar file, before that commit, hoodie-utilities.jar is a fat jar. We can work together to figure out what changes. :D [1][https://github.com/apache/incubator-hudi/commit/7a973a694488930627028b3175a731a23e3d3f32] > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006641#comment-17006641 ] Pratyaksh Sharma commented on HUDI-486: --- [~vinoth] Let me try doing that and get back to you. > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006278#comment-17006278 ] Vinoth Chandar commented on HUDI-486: - Hmmm.. could we replicate `hudi-hive-bundle` . Seems like this needs to be different than utilities-bundle (for obvious reasons).. How about moving the tool itself to hudi-hive? and reuse hudi-hive-bundle? > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006259#comment-17006259 ] lamber-ken commented on HUDI-486: - (y) +1 > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller
[ https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005392#comment-17005392 ] Pratyaksh Sharma commented on HUDI-486: --- Here is the working command - java -cp /tmp/jars/commons-io-2.6.jar:/tmp/jars/protobuf-java-2.5.0.jar:/tmp/jars/commons-cli-1.2.jar:/tmp/jars/htrace-core-3.1.0-incubating.jar:/tmp/jars/hadoop-hdfs-2.7.3.jar:/tmp/jars/hive-serde-2.3.1.jar:/tmp/jars/hive-common-2.3.1.jar:/tmp/jars/hive-service-2.3.1.jar:/tmp/jars/hive-service-rpc-2.3.1.jar:/tmp/jars/httpcore-4.3.2.jar:/tmp/jars/libthrift-0.12.0.jar:/tmp/jars/slf4j-api-1.7.16.jar:/tmp/jars/hadoop-auth-2.7.3.jar:/var/hoodie/ws/docker/demo/config/commons-lang-2.6.jar:/var/hoodie/ws/docker/demo/config/commons-configuration-1.6.jar:/var/hoodie/ws/docker/demo/config/commons-collections-3.2.2.jar:/var/hoodie/ws/docker/demo/config/guava-15.0.jar:/var/hoodie/ws/docker/demo/config/commons-logging-1.1.3.jar:/var/hoodie/ws/docker/demo/config/hadoop-common-2.7.3.jar:/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:/var/hoodie/ws/docker/demo/config/antlr-runtime-3.5.2.jar:$HUDI_UTILITIES_BUNDLE org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default --sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable --fromCommitTime 0 --maxCommits 1 Options - # Mention the above command somewhere # List down the jars needed # create a script similar to how we have for HiveSyncTool. We can choose either one of the above. [~vinoth] WDYT? > Improve documentation for using HiveIncrementalPuller > - > > Key: HUDI-486 > URL: https://issues.apache.org/jira/browse/HUDI-486 > Project: Apache Hudi (incubating) > Issue Type: Sub-task > Components: Incremental Pull >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > > For using HiveIncrementalPuller, one needs to have a lot of jars in > classPath. These jars are not listed anywhere. As a result, one has to keep > on adding the jars incrementally to the classPath with every > NoClassDefFoundError coming up when executing. > We should list down the jars needed so that it becomes easy for a first-time > user to use the mentioned tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)