[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007908#comment-17007908
 ] 

lamber-ken commented on HUDI-486:
-

Let me summarize the reproduce steps.

1, use git checkout commit e1e5fe33249bf511486073dd9cf48e5b7ea14816 

2, build source
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true 
{code}
3, setup docker 
{code:java}
cd docker && ./setup_demo.sh{code}
4, generate data
{code:java}
cat demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
{code}
5, go into the container
{code:java}
docker exec -it adhoc-2 /bin/bash 
{code}
6, comsume kafka data && sync to hive
{code:java}
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
--storage-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties 
\
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider

/var/hoodie/ws/hudi-hive/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:1 \
--user hive \
--pass hive \
--partitioned-by dt \
--base-path /user/hive/warehouse/stock_ticks_cow \
--database default \
--table stock_ticks_cow
{code}
7, create incr_pull.txt
{code:java}
select `_hoodie_commit_time`, symbol, ts from default.stock_ticks_cow where  
symbol = 'GOOG' and `_hoodie_commit_time` > '20180924064621'
{code}
8, execute org.apache.hudi.utilities.HiveIncrementalPuller
{code:java}
java -cp 
./spark/jars/commons-cli-1.2.jar:./spark/jars/htrace-core-3.1.0-incubating.jar:./spark/jars/hadoop-hdfs-2.7.3.jar:./hive/lib/hive-exec-2.3.3.jar:./hive/lib/hive-common-2.3.3.jar:./hive/lib/hive-jdbc-2.3.3.jar:./hive/lib/hive-service-2.3.3.jar:./hive/lib/hive-service-rpc-2.3.3.jar:./spark/jars/httpcore-4.4.10.jar:./spark/jars/slf4j-api-1.7.16.jar:./spark/jars/hadoop-auth-2.7.3.jar:./hive/lib/commons-lang-2.6.jar:./spark/jars/commons-configuration-1.6.jar:./spark/jars/commons-collections-3.2.2.jar:./spark/jars/hadoop-common-2.7.3.jar:./hive/lib/antlr-runtime-3.5.2.jar:./spark/jars/log4j-1.2.17.jar:./hive/lib/commons-logging-1.2.jar:./hive/lib/commons-io-2.4.jar:$HUDI_UTILITIES_BUNDLE
 \ org.apache.hudi.utilities.HiveIncrementalPuller \ --hiveUrl 
jdbc:hive2://hiveserver:1 \ --hiveUser hive \ --hivePass hive \ 
--extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt \ --sourceDb 
default \ --sourceTable stock_ticks_cow \ --targetDb default \ --tmpdb default 
\ --targetTable tempTable \ --fromCommitTime 0 \ --maxCommits 1
{code}

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-02 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006710#comment-17006710
 ] 

Pratyaksh Sharma commented on HUDI-486:
---

Sure [~lamber-ken]. I am going through the diff of this commit that you 
mentioned above. I will sync with you in some time. 

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-02 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006653#comment-17006653
 ] 

lamber-ken commented on HUDI-486:
-

Hi [~Pratyaksh], I had figured it out that it's commit[1] affects the 
hoodie-utilities.jar file, before that commit, hoodie-utilities.jar is a fat 
jar.

We can work together to figure out what changes. :D

[1][https://github.com/apache/incubator-hudi/commit/7a973a694488930627028b3175a731a23e3d3f32]

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2020-01-01 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006641#comment-17006641
 ] 

Pratyaksh Sharma commented on HUDI-486:
---

[~vinoth] Let me try doing that and get back to you. 

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-31 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006278#comment-17006278
 ] 

Vinoth Chandar commented on HUDI-486:
-

Hmmm.. could we replicate `hudi-hive-bundle` . Seems like this needs to be 
different than utilities-bundle (for obvious reasons).. How about moving the 
tool itself to hudi-hive? and reuse hudi-hive-bundle? 

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-31 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006259#comment-17006259
 ] 

lamber-ken commented on HUDI-486:
-

(y) +1

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005392#comment-17005392
 ] 

Pratyaksh Sharma commented on HUDI-486:
---

Here is the working command - java -cp 
/tmp/jars/commons-io-2.6.jar:/tmp/jars/protobuf-java-2.5.0.jar:/tmp/jars/commons-cli-1.2.jar:/tmp/jars/htrace-core-3.1.0-incubating.jar:/tmp/jars/hadoop-hdfs-2.7.3.jar:/tmp/jars/hive-serde-2.3.1.jar:/tmp/jars/hive-common-2.3.1.jar:/tmp/jars/hive-service-2.3.1.jar:/tmp/jars/hive-service-rpc-2.3.1.jar:/tmp/jars/httpcore-4.3.2.jar:/tmp/jars/libthrift-0.12.0.jar:/tmp/jars/slf4j-api-1.7.16.jar:/tmp/jars/hadoop-auth-2.7.3.jar:/var/hoodie/ws/docker/demo/config/commons-lang-2.6.jar:/var/hoodie/ws/docker/demo/config/commons-configuration-1.6.jar:/var/hoodie/ws/docker/demo/config/commons-collections-3.2.2.jar:/var/hoodie/ws/docker/demo/config/guava-15.0.jar:/var/hoodie/ws/docker/demo/config/commons-logging-1.1.3.jar:/var/hoodie/ws/docker/demo/config/hadoop-common-2.7.3.jar:/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:/var/hoodie/ws/docker/demo/config/antlr-runtime-3.5.2.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1

Options - 
 # Mention the above command somewhere
 # List down the jars needed
 # create a script similar to how we have for HiveSyncTool. 

We can choose either one of the above. [~vinoth] WDYT?

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)