[ https://issues.apache.org/jira/browse/HUDI-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
vinoyang closed HUDI-574. ------------------------- > CLI counts small file inserts as updates > ---------------------------------------- > > Key: HUDI-574 > URL: https://issues.apache.org/jira/browse/HUDI-574 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: CLI > Reporter: Vinoth Chandar > Assignee: lamber-ken > Priority: Major > Labels: pull-request-available > Fix For: 0.5.2 > > Time Spent: 20m > Remaining Estimate: 0h > > User report : > > I'm trying to understand the {{.commit}} output and how it relates to the > output from the {{hudi-cli}} tool and i'm finding it difficult to reconcile > my findings. specifically, i want to know the number of > updates/inserts/deletes across all partitions for a given commit (an upsert). > From the {{cli}}: > hudi:exec_unit_ver->commit showpartitions --commit 20200108153617 > ╔════════════════╤═══════════════════╤═════════════════════╤════════════════════════╤═══════════════════════╤═════════════════════╤══════════════╗ > ║ Partition Path │ Total Files Added │ Total Files Updated │ Total Records > Inserted │ Total Records Updated │ Total Bytes Written │ Total Errors ║ > ╠════════════════╪═══════════════════╪═════════════════════╪════════════════════════╪═══════════════════════╪═════════════════════╪══════════════╣ > ║ 0 │ 0 │ 9 │ 0 > │ 2091 │ 983.7 MB │ 0 ║ > ╟────────────────┼───────────────────┼─────────────────────┼────────────────────────┼───────────────────────┼─────────────────────┼──────────────╢ > But in the {{20200108153617.commit}} file for that commit one of the files in > the partition "0" has > "numInserts" : 44448, > so not sure why {{Total Records Inserted}} is reported as zero. I checked > that the sum of {{numUpdateWrites}} across all files in the partition matches > 2091. Generally, i think it would be helpful to have {{totalRecordsInserted}} > {{totalRecordsUpdated}} {{totalRecordsDeleted}} in the commit metadata > (although it's not a big issue to sum the individual numbers from each file > in each partition). > > [~vinoth] > > On the counts, when I checked the code, its counting the inserts as updats, > since Hudi packed them onto existing files, to honor target file size .. > for (HoodieWriteStat stat : stats) { > if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) { > totalFilesAdded += 1; > totalRecordsInserted += stat.getNumWrites(); > } else { > totalFilesUpdated += 1; > totalRecordsUpdated += stat.getNumUpdateWrites(); > } > totalBytesWritten += stat.getTotalWriteBytes(); > totalWriteErrors += stat.getTotalWriteErrors(); > } > -- This message was sent by Atlassian Jira (v8.3.4#803005)