[jira] [Comment Edited] (STORM-2355) Storm-HDFS: inotify support

2017-03-17 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929183#comment-15929183
 ] 

Roshan Naik edited comment on STORM-2355 at 3/17/17 9:23 PM:
-

Fist of all... thanks for your work on this.

I took a look at the HDFS's inotify side of things and also spoke to a Hdfs 
committer.  These over-arching concerns came up (partly alluded to earlier by 
you):

1. Currently INotify is restricted to HDFS admins because it doesn't scale (wrt 
namenode). So in its current state it seems unsuitable for Hdfs Spout kind of 
use case... even if we (unrealistically) asked users to run Storm worker as a 
HDFS admin user.
2. The proposal for scaling Inotify and opening it up to end users appears to 
have stalled for some time now. 

Although we are seeing some improvements (that you noted) in resource 
utilization from the Storm side, it seems not advisable from the HDFS namenode 
perspective. I think this feature in HDFS Spout would be useful once a 
scaleable inotify solution is made publicly available by HDFS.

The other option is to get this into Storm now and not use it till HDFS 
implements their scaleable inotify. My concern with that is we cant bet with 
certainty that final inotify will still work as we now expect it to (although 
the intent is there) ... it may even change in a incompatible way. 

Either way it appears like a feature that cant be used until the scaleable 
inotify happens (if it happens).


was (Author: roshan_naik):
Fist of all... thanks for your work on this.

I took a look at the HDFS's inotify side of things and also spoke to a Hdfs 
committer.  These over-arching concerns came up (partly alluded to earlier by 
you):

1. Currently INotify is restricted to HDFS admins because it doesn't scale (wrt 
namenode). So in its current state it seems unsuitable for Hdfs Spout kind of 
use case... even if we (unrealistically) asked users to run Storm worker as a 
HDFS admin user.
2. The proposal for scaling Inotify and opening it up to end users appears to 
have stalled for some time now. 

Although we are seeing some improvements (that you noted) in resource 
utilization from the Storm side, it seems not advisable from the HDFS namenode 
perspective. I think this feature in HDFS would be useful once a scaleable 
inotify solution is made publicly available by HDFS.

The other option is to get this into Storm now and not use it till HDFS 
implements their scaleable inotify. My concern with that is we cant bet with 
certainty that final inotify will still work as we now expect it to (although 
the intent is there) ... it may even change in a incompatible way. 

Either way it appears like a feature that cant be used until the scaleable 
inotify happens (if it happens).

> Storm-HDFS: inotify support
> ---
>
> Key: STORM-2355
> URL: https://issues.apache.org/jira/browse/STORM-2355
> Project: Apache Storm
>  Issue Type: New Feature
>  Components: storm-hdfs
>Reporter: Tibor Kiss
>Assignee: Tibor Kiss
> Fix For: 2.0.0, 1.1.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This is a proposal to implement inotify based watch dir monitoring in 
> Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS's HdfsSpout currently polls the Spout’s input directory using 
> Hadoop's {{FileSystem.listFiles()}}. This operation is expensive since it 
> returns the block locations and all stat information of the files inside the 
> watch directory. Moreover HdfsSpout currently uses only one element of the 
> returned Path list which is inefficient as the rest of the entries are thrown 
> away without processing.
> The proposed design provides greater efficiency through the inotify interface 
> and also enables to easier extension of the original ({{listFiles()}} based) 
> monitoring with buffering (see Further work section below). 
> *High level design*
> Goal is to leverage [HDFS inotify 
> API|http://hadoop.apache.org/docs/current/api//org/apache/hadoop/hdfs/DFSInotifyEventInputStream.html]
>  to monitor new file arrival to HdfsSpout's input directory.
> The inotify based monitoring is an addition to the original 
> {{FileSystem.listFiles()}} based implementation, the default behavior of the 
> spout will be unchanged by this modification.
> To unify the two monitoring methods and enable buffering an iterator based 
> ({{HdfsDirectoryMonitor}}) class is created.
> To retain backward compatibility the HdfsSpout's default monitoring behavior 
> is unchanged, inotify based monitoring could be enabled through a parameter.
> As inotify requires administrative privileges (see Caveat section below) a 
> fallback mechanism is be implemented in HdfsSpout to use the original 
> {{listFiles()}} based monitoring if initialization fails for inotify based 
> monitoring.
> *Implementation 

[jira] [Comment Edited] (STORM-2355) Storm-HDFS: inotify support

2017-02-23 Thread Tibor Kiss (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878186#comment-15878186
 ] 

Tibor Kiss edited comment on STORM-2355 at 2/24/17 6:29 AM:


[~roshan_naik]: Thanks for taking a look on this issue and sorry for not coming 
back earlier (I was OoO).
As you requested I've extended the description of the proposal trying to 
address your questions.

During the enhancement I realized that I shall make two improvements on my 
original implementation:
1) {{listFiles()}} shall be called during the initialization to ensure that 
leftover files are also processed
-2) The number of calls to inotify's {{poll()}} could be further reduced.-

I'll extend my PR with these improvements soon.

Thanks for your feedback!


was (Author: tibor.k...@gmail.com):
[~roshan_naik]: Thanks for taking a look on this issue and sorry for not coming 
back earlier (I was OoO).
As you requested I've extended the description of the proposal trying to 
address your questions.

During the enhancement I realized that I shall make two improvements on my 
original implementation:
1) {{listFiles()}} shall be called during the initialization to ensure that 
leftover files are also processed
2) The number of calls to inotify's {{poll()}} could be further reduced.

I'll extend my PR with these improvements soon.

Thanks for your feedback!

> Storm-HDFS: inotify support
> ---
>
> Key: STORM-2355
> URL: https://issues.apache.org/jira/browse/STORM-2355
> Project: Apache Storm
>  Issue Type: New Feature
>  Components: storm-hdfs
>Reporter: Tibor Kiss
>Assignee: Tibor Kiss
> Fix For: 2.0.0, 1.1.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to implement inotify based watch dir monitoring in 
> Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS's HdfsSpout currently polls the Spout’s input directory using 
> Hadoop's {{FileSystem.listFiles()}}. This operation is expensive since it 
> returns the block locations and all stat information of the files inside the 
> watch directory. Moreover HdfsSpout currently uses only one element of the 
> returned Path list which is inefficient as the rest of the entries are thrown 
> away without processing.
> The proposed design provides greater efficiency through the inotify interface 
> and also enables to easier extension of the original ({{listFiles()}} based) 
> monitoring with buffering (see Further work section below). 
> *High level design*
> Goal is to leverage [HDFS inotify 
> API|http://hadoop.apache.org/docs/current/api//org/apache/hadoop/hdfs/DFSInotifyEventInputStream.html]
>  to monitor new file arrival to HdfsSpout's input directory.
> The inotify based monitoring is an addition to the original 
> {{FileSystem.listFiles()}} based implementation, the default behavior of the 
> spout will be unchanged by this modification.
> To unify the two monitoring methods and enable buffering an iterator based 
> ({{HdfsDirectoryMonitor}}) class is created.
> To retain backward compatibility the HdfsSpout's default monitoring behavior 
> is unchanged, inotify based monitoring could be enabled through a parameter.
> As inotify requires administrative privileges (see Caveat section below) a 
> fallback mechanism is be implemented in HdfsSpout to use the original 
> {{listFiles()}} based monitoring if initialization fails for inotify based 
> monitoring.
> *Implementation details*
> As inotify provides only a delta of the filesystem events from a given Tx Id 
> (of Hdfs Edit Log) it is required to do a {{FileSystem.listFiles()}} based 
> collection during the Spout's initialization to ensure that any left over 
> files are processed.
> The inotify based implementation uses HdfsAdmin's 
> [{{DFSInotifyEventInputStream.poll()}}|http://hadoop.apache.org/docs/current/api//org/apache/hadoop/hdfs/DFSInotifyEventInputStream.html#poll--]
>  method to fetch and buffer the list of new files created since the provided 
> Tx Id to {{newFileList}} buffer.
> During {{HdfsSpout.nextTuple()}} call one element is taken from the 
> {{newFileList}} buffer and processed by the spout.
> The {{newFileList}} buffer is extended with the result of the 
> {{DFSInotifyEventInputStream.poll(lastTxId)}} call in every nextTuple() call.
> Since HdfsSpout is able to create it's own {{HdfsAdmin()}} instance there 
> will be no need for the user to do additional initialization for the spout 
> even if inotify is enabled.
> *Caveat*
> HDFS inotify is currently available through hdfs administrator user only, but 
> there is ongoing discussion in Hadoop community to extend its support to 
> users. See: HDFS-8940 
> *Further work*
> 1) The number of calls to {{DFSInotifyEventInputStream.poll(lastTxId)}} could 
> be further reduced if the locking directory is moved away from the 

[jira] [Comment Edited] (STORM-2355) Storm-HDFS: inotify support

2017-02-12 Thread Tibor Kiss (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862722#comment-15862722
 ] 

Tibor Kiss edited comment on STORM-2355 at 2/12/17 10:13 AM:
-

Initial implementation for 1.0.x-branch could be found here: 
https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.

The implementation was tested using UTs and in a three node dockerized cluster 
using Flux and simple passthrough topology via Storm-Spout & Storm Bolt. 
Using inotify the load on HDFS was reduced by 15%. Nonetheless more precise 
performance measurement would have been needed in a non-dockerized environment.


was (Author: tibor.k...@gmail.com):
Initial implementation for 1.0.x-branch could be found here: 
https://github.com/tibkiss/storm/commit/d916d6f904ea085ebdaf5ada2a9c0607794d3c50

Note that I needed to lower guava version to be hdfs compatible (14.0.1).
I have also bumped Hadoop version to 2.7.3.


> Storm-HDFS: inotify support
> ---
>
> Key: STORM-2355
> URL: https://issues.apache.org/jira/browse/STORM-2355
> Project: Apache Storm
>  Issue Type: New Feature
>  Components: storm-hdfs
>Reporter: Tibor Kiss
>Assignee: Tibor Kiss
> Fix For: 2.0.0, 1.1.0
>
>
> This is a proposal to implement inotify based watch dir monitoring in 
> Storm-HDFS Spout.
> *Motivation*
> Storm-HDFS currently polls the input directory using Hadoop's 
> {{FileSystem.listFiles}}. This operation is expensive since it returns the 
> block locations and all stat information of the files inside the watch 
> directory. Storm-HDFS currently uses only one element's Path of the returned 
> list which is inefficient.
> *Proposed improvement*
> Provide a way to monitor the input directory through HDFS's inotify API.
> In order to have backward compatibility with the poll based solution I 
> propose a new class ({{HdfsDirectoryMonitor}}) which implements both the 
> inotify and poll based solution through a iterator. The user can enable 
> inotify based polling through a configuration parameter.
> *Caveat*
> HDFS inotify is currently only available for root user, but there is ongoing 
> discussion in Hadoop community to extend its support to users. See: HDFS-8940 
> *Testing related changes*
> The {{TestHdfsSpout}} testcase should be parametrized to check for both the 
> poll & inotify based solution.
> *Further work*
> If the design is accepted the poll based solution could easily improved 
> through {{HdfsDirectoryMonitor}} to properly use all the returned items from 
> the work directory (similar to inotify based solution). Such improvement will 
> reduce the number of calls made to {{FileSystem.listFiles}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)