[jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector

2017-01-13 Thread Aeham Abushwashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aeham Abushwashi updated CONNECTORS-1364:
-
Attachment: CONNECTORS-1364.git.v2.patch

It’s a fair comment. In my use case, I have a client application that’s talking 
to manifold through the API so I have to implement this logic either way. I 
figured it’d be useful for others too but perhaps other advanced users would 
prefer to use their own bin naming convention. 
I could see a future use for share and root folder being passed in to the repo 
connector but I think it’d be better to introduce those as first class 
citizens, and not optional parameters, should the need for them ever arise.

Here’s an updated patch..

> Better bin naming in the Shared Drive Connector
> ---
>
> Key: CONNECTORS-1364
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1364
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 1.9
>Reporter: Aeham Abushwashi
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.7
>
> Attachments: CONNECTORS-1364.git.patch, CONNECTORS-1364.git.v2.patch
>
>
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not 
> always valid. 
> As I understand it, Manifold uses bins to prevent overloading data sources. 
> In the SDC, server name is designated as bin name. All jobs created against a 
> particular server will be treated as one unit when documents are prioritised, 
> which can severely disadvantage some jobs (e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. 
> In Windows DFS, which is widely used in large enterprises, what the SDC 
> thinks of as a server name, isn’t actually a physical resource. It’s a 
> namespace that can span many servers and shares. In this case, it doesn’t 
> make sense to throttle simply on the root ‘server’ name. In other 
> environments, a powerful storage server can be more than capable of handling 
> high crawl load; overzealous throttling can end up limiting/hurting 
> Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards 
> passing in to the repo connection config some sort of server topology flag or 
> throttling depth flag as a hint that ShareDriveConnector#getBinNames can use 
> to decide whether the bin name should be server, server+share or 
> server+share+root_folder. Share and root_folder would need to be explicitly 
> passed in the repo config too or extracted from the documentIdentifier arg in 
> getBinNames (assuming it's reliable).
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector

2017-01-12 Thread Aeham Abushwashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aeham Abushwashi updated CONNECTORS-1364:
-
Attachment: CONNECTORS-1364.git.patch

Patch attached. 
In addition to configurable bin names in the jcifs connection, I’ve made the 
number of docs requested by the priority thread configurable. This was 
previously hard-coded at 1000.

> Better bin naming in the Shared Drive Connector
> ---
>
> Key: CONNECTORS-1364
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1364
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 1.9
>Reporter: Aeham Abushwashi
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.7
>
> Attachments: CONNECTORS-1364.git.patch
>
>
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not 
> always valid. 
> As I understand it, Manifold uses bins to prevent overloading data sources. 
> In the SDC, server name is designated as bin name. All jobs created against a 
> particular server will be treated as one unit when documents are prioritised, 
> which can severely disadvantage some jobs (e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. 
> In Windows DFS, which is widely used in large enterprises, what the SDC 
> thinks of as a server name, isn’t actually a physical resource. It’s a 
> namespace that can span many servers and shares. In this case, it doesn’t 
> make sense to throttle simply on the root ‘server’ name. In other 
> environments, a powerful storage server can be more than capable of handling 
> high crawl load; overzealous throttling can end up limiting/hurting 
> Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards 
> passing in to the repo connection config some sort of server topology flag or 
> throttling depth flag as a hint that ShareDriveConnector#getBinNames can use 
> to decide whether the bin name should be server, server+share or 
> server+share+root_folder. Share and root_folder would need to be explicitly 
> passed in the repo config too or extracted from the documentIdentifier arg in 
> getBinNames (assuming it's reliable).
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector

2017-01-06 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1364:

Fix Version/s: ManifoldCF 2.7

> Better bin naming in the Shared Drive Connector
> ---
>
> Key: CONNECTORS-1364
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1364
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 1.9
>Reporter: Aeham Abushwashi
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.7
>
>
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not 
> always valid. 
> As I understand it, Manifold uses bins to prevent overloading data sources. 
> In the SDC, server name is designated as bin name. All jobs created against a 
> particular server will be treated as one unit when documents are prioritised, 
> which can severely disadvantage some jobs (e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. 
> In Windows DFS, which is widely used in large enterprises, what the SDC 
> thinks of as a server name, isn’t actually a physical resource. It’s a 
> namespace that can span many servers and shares. In this case, it doesn’t 
> make sense to throttle simply on the root ‘server’ name. In other 
> environments, a powerful storage server can be more than capable of handling 
> high crawl load; overzealous throttling can end up limiting/hurting 
> Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards 
> passing in to the repo connection config some sort of server topology flag or 
> throttling depth flag as a hint that ShareDriveConnector#getBinNames can use 
> to decide whether the bin name should be server, server+share or 
> server+share+root_folder. Share and root_folder would need to be explicitly 
> passed in the repo config too or extracted from the documentIdentifier arg in 
> getBinNames (assuming it's reliable).
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1364) Better bin naming in the Shared Drive Connector

2017-01-06 Thread Aeham Abushwashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aeham Abushwashi updated CONNECTORS-1364:
-
Description: 
Hello and happy new year!

Bin naming in the Shared Drive Connector makes assumptions that are not always 
valid. 

As I understand it, Manifold uses bins to prevent overloading data sources. In 
the SDC, server name is designated as bin name. All jobs created against a 
particular server will be treated as one unit when documents are prioritised, 
which can severely disadvantage some jobs (e.g. late starters). 
Moreover, this is incompatible with some common enterprise server topologies. 
In Windows DFS, which is widely used in large enterprises, what the SDC thinks 
of as a server name, isn’t actually a physical resource. It’s a namespace that 
can span many servers and shares. In this case, it doesn’t make sense to 
throttle simply on the root ‘server’ name. In other environments, a powerful 
storage server can be more than capable of handling high crawl load; 
overzealous throttling can end up limiting/hurting Manifold’s performance there.

I’m struggling to find a single solution that fits all so I’m leaning towards 
passing in to the repo connection config some sort of server topology flag or 
throttling depth flag as a hint that ShareDriveConnector#getBinNames can use to 
decide whether the bin name should be server, server+share or 
server+share+root_folder. Share and root_folder would need to be explicitly 
passed in the repo config too or extracted from the documentIdentifier arg in 
getBinNames (assuming it's reliable).

Thoughts?

  was:
Hello and happy new year!

Bin naming in the Shared Drive Connector makes assumptions that are not always 
valid. 

As I understand it, Manifold uses bins to prevent overloading data sources. In 
the SDC, server name is designated as bin name. All jobs created against a 
particular server will be treated as one unit when documents are prioritised, 
which can severely disadvantage some jobs (e.g. late starters). 
Moreover, this is incompatible with some common enterprise server topologies. 
In Windows DFS, which is widely used in large enterprises, what the SDC thinks 
of as a server name, isn’t actually a physical resource. It’s a namespace that 
can span many servers and shares. In this case, it doesn’t make sense to 
throttle simply on the root ‘server’ name. In other environments, a powerful 
storage server can be more than capable of handling high crawl load; 
overzealous throttling can end up limiting/hurting Manifold’s performance there.

I’m struggling to find a single solution that fits all so I’m leaning towards 
passing in to the repo connection config some sort of server topology flag or 
throttling depth flag as a hint that ShareDriveConnector#getBinNames can use to 
decide whether the bin name should be server, server+share or 
server+share+root_folder. Share and root_folder would need to be explicitly 
passed in the repo config too.

Thoughts?


> Better bin naming in the Shared Drive Connector
> ---
>
> Key: CONNECTORS-1364
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1364
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: JCIFS connector
>Affects Versions: ManifoldCF 1.9
>Reporter: Aeham Abushwashi
>
> Hello and happy new year!
> Bin naming in the Shared Drive Connector makes assumptions that are not 
> always valid. 
> As I understand it, Manifold uses bins to prevent overloading data sources. 
> In the SDC, server name is designated as bin name. All jobs created against a 
> particular server will be treated as one unit when documents are prioritised, 
> which can severely disadvantage some jobs (e.g. late starters). 
> Moreover, this is incompatible with some common enterprise server topologies. 
> In Windows DFS, which is widely used in large enterprises, what the SDC 
> thinks of as a server name, isn’t actually a physical resource. It’s a 
> namespace that can span many servers and shares. In this case, it doesn’t 
> make sense to throttle simply on the root ‘server’ name. In other 
> environments, a powerful storage server can be more than capable of handling 
> high crawl load; overzealous throttling can end up limiting/hurting 
> Manifold’s performance there.
> I’m struggling to find a single solution that fits all so I’m leaning towards 
> passing in to the repo connection config some sort of server topology flag or 
> throttling depth flag as a hint that ShareDriveConnector#getBinNames can use 
> to decide whether the bin name should be server, server+share or 
> server+share+root_folder. Share and root_folder would need to be explicitly 
> passed in the repo config too or extracted from the documentIdentifier arg in 
> getBinNames (assuming it's reliable).
>