[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it (if someone modified that value, the view wouldn't work 
anyway).

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it (if someone modified that value, the view wouldn't work 
anyway).

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed (probably by the 
elasticsearch library itself), but if 'use_ssl' is True in section 
'elasticsearch_configs' of the configuration, that default port is not changed 
to 443 as should be. If no port is defined in 'host', and 'use_ssl' is True, 
then ':443' could be appended to the value of 'host' in order to avoid the 
hassle.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it (if someone modified that value, the view wouldn't work 
anyway).

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
is True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it (if someone modified that value, the view wouldn't work 
anyway).

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all inheriting subclasses. 
This field could be added only if the 'remote_logging' is set to True for 
example in order to avoid writing more logs than necessary.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example for fluentd:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' option named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- under section 'elasticsearch_configs' option named 'use_ssl'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' options named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' options named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should contain the whole URL templated by 
'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

* in the configuration,
- under section 'core' options named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

* in the configuration,
- under section 'core' options named 'remote_logging'
- under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
* in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
* in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should
contain the whole URL templated by 'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

- in the configuration,
  - under section 'core' options named 'remote_logging'
  - under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
- in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log 

[jira] [Updated] (AIRFLOW-7110) Refactor the way logs are written to/read from Elasticsearch

2020-03-25 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noël BARDELOT updated AIRFLOW-7110:
---
Description: 
See:

- in the configuration,
  - under section 'core' options named 'remote_logging'
  - under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
- in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

{code}
# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  

{code}

What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or other...). The 'frontend' option should
contain the whole URL templated by 'log_id'.

Plus, all this behaviour should be far more documented in the 'write-logs' 
documentation.

Plus, a DAG example should be added to Airflow's already existing example, in 
order to show how a subclass of BaseOperator should use the logging mechanism 
to write a log.

Plus, Elasticsearch's 'host' option under Airflow's configuration is wrongly 
named and documented. It is not a host, but a server with host and port.

Plus, for the same 'host' option, a default port is computed, but if 'use_ssl' 
to True in section 'elasticsearch_configs' of the configuration, that default 
port is not changed to 443 as should be.



  was:
See:

- in the configuration,
  - under section 'core' options named 'remote_logging'
  - under section 'elasticsearch' options named 'host', 'log_id_template', 
'end_of_log_mark', 'frontend', 'write_stdout' and 'json_format'
- in the documentation,
  - https://airflow.apache.org/docs/stable/howto/write-logs.html
- in the Airflow code,
  - BaseOperator in airflow/models/baseoperator.py
  - ElasticsearchTaskHandler in airflow/utils/log/es_task_handler.py
  - the detail of a task's view in airflow/www/views.py

Current behaviour:

If one sets the options as stated in the documentation in order to write logs 
from the task to the worker stdout logs in JSON format, and upload those logs 
in Elasticsearch (using ELK or an equivalent stack), the field 'log_id' needed 
by the handler and the view to work is not written by default to the logs.

One needs to adapt the log collection from the worker's container stdout logs 
in order to provide this 'log_id' field, which agregates other usual fields 
(see 'log_id_template' which is used by the handler to request logs from 
Elasticsearch to be shown in the view).

An example:

# Also add log_id to airflow.task logs

  @type record_transformer
  enable_ruby
  
log_id 
${record["dag_id"]}_${record["task_id"]}_${record["execution_date"]}_${record["try_number"]}
offset ${time = Time.now; time.to_i * (10 ** 9) + time.nsec}
  


What we expect:

It should be the job of the Operator to log the needed 'log_id' field (Airflow 
reads it and needs it, so Airflow should write it itself...). That should be 
done in the BaseOperator, so that it's available to all children classes.

Plus, there is no benefit to make the 'log_id_template' configurable. As is, 
there is no way to make anything else than the current value work, and nobody 
should ever modify it.

Plus, the 'frontend' configuration option should not presume that 'https://' is 
the protoocol to use in order to make a link to the external log viewer (Kibana 
or