This isn't really an Accumulo problem but I'd be grateful to know if anybody else has hit and/or solved it. I'm trying to export table of ~160B key/value pairs using exporttable and distcp as shown here: https://accumulo.apache.org/1.7/examples/export
The command I'm using is "hadoop distcp -m 50 -update -skipcrccheck -f /export/mytable/distcp.txt file:///mnt/backup" distcp.txt contains 718 files. The distcp job never even makes it into YARN, it looks like the driver is stuck sorting the file listing for some reason. An example stack trace is: "main" #1 prio=5 os_prio=0 tid=0x00007f1994015000 nid=0x7dc2 runnable [0x00007f199c735000] java.lang.Thread.State: RUNNABLE at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797) at java.util.regex.Pattern$Curly.match0(Pattern.java:4250) at java.util.regex.Pattern$Curly.match(Pattern.java:4234) at java.util.regex.Pattern$Start.match(Pattern.java:3461) at java.util.regex.Matcher.search(Matcher.java:1248) at java.util.regex.Matcher.find(Matcher.java:637) at java.util.regex.Pattern.split(Pattern.java:1209) at java.lang.String.split(String.java:2380) at java.lang.String.split(String.java:2422) at org.apache.hadoop.util.StringUtils.getTrimmedStrings(StringUtils.java:378) at org.apache.hadoop.conf.Configuration.getTrimmedStrings(Configuration.java:1900) at org.apache.hadoop.io.serializer.SerializationFactory.<init>(SerializationFactory.java:58) at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1176) at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1094) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:273) at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.flush(SequenceFile.java:2946) at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:2890) at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:2788) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2736) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2777) at org.apache.hadoop.tools.util.DistCpUtils.sortListing(DistCpUtils.java:364) at org.apache.hadoop.tools.CopyListing.validateFinalListing(CopyListing.java:145) at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:91) at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:90) at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:84) at org.apache.hadoop.tools.FileBasedCopyListing.doBuildListing(FileBasedCopyListing.java:70) at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:84) at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:382) at org.apache.hadoop.tools.DistCp.createAndSubmitJob(DistCp.java:181) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:153) at org.apache.hadoop.tools.DistCp.run(DistCp.java:126) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.tools.DistCp.main(DistCp.java:430) It's been at it for 16 hours. The exact stack trace varies but it's always within DistCpUtils.sortListing. The Hadoop distro is HDP 2.3.4, Hadoop 2.7.1. HDFS is running with Kerberos and encryption. Any advice is very welcome! Thanks, -Russ
