We have a very large body of CSV files (well over 1TB) that need to be imported into HBase. For a single 20GB segment, we are looking at having to push easily 100M flowfiles into HBase and most of the JSON files generated are rather small (like 20-250 bytes).
It's going very slowly, and I assume that is because we're taxing the disk very heavily because of the content and provenance repositories coming into play. So I'm wondering if anyone has a suggestion on a good NiFiesque way of solving this. Right now, I'm considering two options: 1. Looking for a way to inject the HBase controller service into an ExecuteScript processor so I can handle the data in large chunks (splitting text and generating a List<Put> inside the processor myself and doing one huge Put) 2. Creating a library that lets me generate HFiles from within an ExecuteScript processor. What I really need is something fast within NiFi that would let me generate huge blocks of updates for HBase and push them out. Any ideas? Thanks, Mike
