I agree with James. The general pattern here is Split with Grouping: Take a look at RouteText. This allows you to efficiently split up line oriented data into groups based on matching values rather than spilt text which will be a line for line split.
Merge Grouped Data: MergeContent processor will do the trick and you can use correlation feature to align only those which are from the same group/pattern. Write to destination: You can write directly to HDFS using PutHDFS or you can prepare the data and write to Hive. Thanks Joe On Wed, Nov 2, 2016 at 9:01 PM, James Wing <[email protected]> wrote: > This is absolutely possible. A sample sequence of processors might include: > > 1. UpdateAttribute - to extract a record date from the flowfile content into > an attribute, 'recordgroup' for example > 2. MergeContent - to group related records together, setting the Correlation > Attribute Name property to use 'recordgroup' > 3. UpdateAttribute - (optional) to apply the 'recordgroup' attribute to the > 'path' and/or 'filename' attributes, depending on how you do #4. May be > useful to get customized filenames with extensions. > 4. Put* - to write the grouped file to storage (PutFile, PutHDFS, > PutS3Object, etc.). With PutHDFS for example, use Expression Language in > the Directory property to apply your grouping - like > '/tmp/hive/records/${recordgroup}' to get '/tmp/hive/records/2016-01-01'. > > In concept, it's that simple. The #2 MergeContent step can be more > complicated as you consider how many files should be output from the stream, > how big they should be, how frequently, and how many bins are likely to be > open collecting files at any one time. You might also consider compressing > the files. > > Thanks, > > James > > On Wed, Nov 2, 2016 at 5:34 PM, Santiago Ciciliani > <[email protected]> wrote: >> >> I'm trying to split a stream of data into multiple different files based >> on the content date. >> >> So imagine that you are receiving streams of logs and you want to save as >> a Hive partitioned table so for example all records with date 2016-01-01 >> into directory dt=2016-01-01. >> >> Is this even possible? >> >> Thanks >> >
