[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-06-18 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510788#comment-16510788
 ] 

Sebastian Nagel edited comment on NUTCH-2578 at 6/18/18 4:19 PM:
-

Merged. I'll test this fix and also TIKA-2645 (using tika-core 1.19-SNAPSHOT) 
during the next two days. Thanks, [~yossi] and [~talli...@apache.org]!


was (Author: wastl-nagel):
Merged. I'll test this fix and also TIKA-2658 (using tika-core 1.19-SNAPSHOT) 
during the next two days. Thanks, [~yossi] and [~talli...@apache.org]!

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>   | grep -A25 'waiting to lock' \
>   | grep -F 'org.apache.tika.Tika.' \
>   | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika 
> detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html], 
> the best solution seems to cache the MimeUtil object in the actual protocol 
> implementation as it is done in Nutch 2.x ([lib-http HttpBase, line 
> #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480867#comment-16480867
 ] 

Sebastian Nagel edited comment on NUTCH-2578 at 5/24/18 2:30 PM:
-

Hi [~yossi], got it: definitely a good idea to keep the Tika instance in the 
object cache. Nevertheless, since we know that MimeUtil is thread-safe and 
ObjectCache getters are synchronized (since NUTCH-1606), I would also hold the 
reference in the protocol implementation. The protocol instance is already 
cached by ProtocolFactory and we avoid extra access of the object cache. Does 
this make sense?
 Shortly about the background how this issue has been detected: while testing 
whether the connection pool of okhttp (see NUTCH-2576) causes any locks, I've 
found that other locks appeared much more often in the stacks: NUTCH-2579, 
TIKA-2645, and some more I need to investigate.


was (Author: wastl-nagel):
Hi [~yossi], got it: definitely a good idea to keep the Tika instance in the 
object cache. Nevertheless, since we know that MimeUtil is thread-safe and 
ObjectCache getters are synchronized (since NUTCH-1606), I would also hold the 
reference in the protocol implementation. The protocol instance is already 
cached by ProtocolFactory and we avoid extra access of the object cache. Does 
this make sense?
Shortly about the background how this issue has been detected: while testing 
whether the connection pool of okhttp (see NUTCH-2576) causes any locks, I've 
found that other locks appeared much more often in the stacks: NUTCH-2578, 
TIKA-2645, and some more I need to investigate.

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 

[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482879#comment-16482879
 ] 

Tim Allison edited comment on NUTCH-2578 at 5/21/18 6:38 PM:
-

Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...or if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!


was (Author: talli...@mitre.org):
Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at 

[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482879#comment-16482879
 ] 

Tim Allison edited comment on NUTCH-2578 at 5/21/18 6:37 PM:
-

Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see [this 
comment|https://issues.apache.org/jira/browse/TIKA-2645?focusedCommentId=16482862=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16482862]

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!


was (Author: talli...@mitre.org):
Based on [~wastl-nagel]'s observation, I updated Apache Tika to reuse 
SAXParsers. I also added a multithreaded test a) for .xml files in our test 
suite and b) all files in our test suite to confirm that Tika.detect() is 
thread-safe.

The speedup for specifically xml root detection is impressive: see this comment.

As I wrote over on the Tika issue:
{quote}make sure to call XMLReaderUtils.setPoolSize(numThreads) to set an 
appropriate sized pool size for your needs...of if you can recommend a way for 
us to autosize, that'd be even better.
{quote}
 

If you are able to grab a nightly build and test on your machines/framework, 
let us know if you find any surprises!

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120