Hello Mark
Thanks for your quick answer.
When I say merge content based on a temporal window can be seen as 2 case :
                1 : Do the merge on 5 min temporal ( or any other period) 
period aligned (with the beginning of the minute for second , with the 
beginning of the hour for minute , beginning of the day for hours …).  What I 
want to is merging the data coming in by bulk of 5min. The only relation 
between the data and the merge is the arrival date. So this use case is simple 
as there is only ordered flowfile coming in. It look like an evolution of the 
merge processor. Merge by window of 5 min and 5 min windows should be aligned 
by some rule definition.
                2 : Same as use case 1 but here the date to be used is coming 
from the data and there is potential out of order  to handle.

I have thought about using the correlation attribute name (but this won’t 
handle the out of order). It may work for the first case by adding an adjusted 
timestamp as attribute. With this I’m sure to output flow files with content on 
the same period, but I still don’t know how to be sure to have one and only one 
flowfile for the period ( which is a key requirement to do the aggregation 
using the QueryRecords) without inducing some latency ( the only way I ‘ve 
found is to put a very large number for min/max flow and put a MawBin age 
larger than the wanted period – but this will induce some latency if I want to 
be safe). Do you have any other approach to handle that ?


For the schema inference , I totally following you , it does not seems 
realistic to infer schema on real time streaming. I was more thinking of tools 
to help development of flows very quickly. It’s nice to know you have that in 
mind even with a very low priority.


Regards





From: Mark Payne [mailto:[email protected]]
Sent: 20 July 2017 15:00
To: [email protected]
Subject: Re: MergeContent Inquiry

Thierry,

I'm not sure that I understand what you mean when you say "is there a way to 
merge content based on temporal window."
Are you wanting to merge based on a rolling window, or a timestamp in the data? 
Can you explain a bit more about what
you want to do in terms of determining which data should go together?

re: QueryRecord, it is not based on Apache Drill. It is based on Apache 
Calcite. I do believe that Apache Calcite powers
Drill's SQL engine as well, but Calcite is just the SQL engine and does not do 
any sort of schema inference. At present,
you need to provide a schema for the data. If your data is in Avro, you can 
simply use the schema embedded in the data.
If the data is in CSV, you can derive the schema automatically from the header 
line (and assume that all fields are Strings).
Otherwise, you'll probably need to use the Schema Registry.

I have considered implementing some sort of schema inference processor, but 
I've not put any sort of priority on it, simply
because in my experience schema inference is convenient when it works, but 
almost always some data will come in that
doesn't adhere properly to the inferred schema and the incorrect inference ends 
up costing more time than it would have
taken to simply create the schema in the first place. Additionally, the schema 
would have to be inferred for every FlowFile,
meaning quite a lot of overhead and inefficiency in doing that. That said, I do 
understand how it would be convenient in some cases,
but I've personally just not been able to prioritize getting something like 
that done. Certainly others in the community are
welcome to look into that.

Thanks
-Mark



On Jul 20, 2017, at 8:37 AM, Thierry Hanot 
<[email protected]<mailto:[email protected]>> wrote:

Hello All
Additional question on this subject  , is there a way to merge content based  
on temporal window. The attributeRollingWindows does not help here.
This can allow in my context to build an aggregation layer ( it’s for Telemetry 
data which are coming in at different rate and I need to normalize/aggregate 
those data ) , the flow may be like this :
                Receive telemetry data
                Merge content based on the type of data and a temporal windows
                Aggregate using QueryRecord to aggregate the bulk of data : 
Normally this should be effective as it’s done per bulk .
                Then stream the result out ( backend / Mom … )

Of course all the aggregation should dynamic by merging and generating the 
query based on attributes qualifying the type of the data and the aggregation 
which need to be done.
Additional question : If I understand correctly , QueryRecord is based on Drill 
, and Drill allow to automatically infer the schema from JSON File. Is there a 
way to use this feature without going thru the SchemaRepository ?

Thanks in advance.

Thierry Hanot




From: James McMahon [mailto:[email protected]]
Sent: 20 July 2017 14:04
To: [email protected]<mailto:[email protected]>
Subject: Re: MergeContent Inquiry

Outstanding. Thank you very much Joe.

On Thu, Jul 20, 2017 at 8:00 AM, Joe Witt 
<[email protected]<mailto:[email protected]>> wrote:
Yep.  Very common.  Set the desired size or number of object targets
and set the 'Max Bin Age' so that it will kick out whatever you've got
by that time.

On Thu, Jul 20, 2017 at 7:38 AM, James McMahon 
<[email protected]<mailto:[email protected]>> wrote:
> Good morning. I have a situation where I have a staging directory into which
> may be dropped a small number or a large multitude of files. My customer
> wants me to package these up - but in a size range. I see that MergeContent
> allows me to set a MinimumGroupSize and a MaximumGroupSize.
>
> If all the files total less than the MinimumGroupSize in MB, would
> MergeContent take no action until enough files arrived to cross the minimum
> threshold - ie, would it just sit and wait? Is it possible to combine the
> size thresholds with a time parameter so that if X time passes and no new
> files appear, the package is created despite falling short of the minimum
> size threshold?
>
> Thanks in advance once again for any insights. -Jim

Reply via email to