Hi,
When i try to index the data using batchSize=150,default batchTimeout and
TimedRotationPolicy which is set to 30 Mins, it creates some json files in HDFS
with incomplete data, the last record in the HDFS file contains only a portion
of the json record. When try to read the indexed data using hive external table
it throws some exception, due to the partial json in the file. So in streaming
i am not able to do any operation on the indexed data.
Indexex File Eg:
{"key1":"value1","key2":"value2","key3":"value3"}
{"key1":"value1","key2":"value2","key3"
When i tried to find the root cause of this behavior, i came across the
following observations
1. Metron flushes the data to HDFS based on a CountSyncPolicy, by default
it's value is set to the batchSize.
2. When metron performs the file rotation , it first closes the current file,
which also result in flushing to HDFS.
3. Regardless of the batchSize, metron writes the data to HDFS after the
batchTimeout.
4. CountSyncPolicy is not having any relation with the batchTimeout, that is
even if the batchTimeout expires and metron writes the data to the HDFS , it
won't init the flush, it still wait for the number of messages to become
the CountSyncPolicy. Is this behavior set intentionally ? without the sync the
end user won't be able to access the data completely, so it spoils the
advantages of batchTimeout.
Due to the amount of data i am writing, i won't be able to set the
CountSyncPolicy to 1, which will impact the performance.
Currently our indexing directory structure is like follows "yyyy/MM/dd". I need
to do some operation on the newly indexed data based on a sliding window, now
it is configured to max_window = 1 and window size = one hour. every hour i
move the window to the current_window_hour + 1. when it's streaming i hit the
JSON format error in hive.
Can you suggest any methods to over come this issue ?