Hi Claudio On 2 July 2010 12:10, Claudio Martella <[email protected]> wrote:
> Thanks for the info. i didn't have this problem before nutch 1.1 where > my biggest change was the introduction of parse-tika. > > 4 out of 5 threads are RUNNABLE in this situation: > > Stack trace: > org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246) > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878) > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > which is actually pretty weird and difficult to understand for me. after > the fetcher aborts with N hung threads, they just stay there hunging. In > the jconsole i can still see the same stracktrace of these threads on > FLVParser.parse(). They are not downloading (i can see the from the > traffic on the network) and they should stop at a certain time (i have > content limit at 50MB). > Maybe the problem is that FLVParser.parse() doesn't like trunkated > content for FLV streams. I'll try to craw-urlfilter-ignore the FLV files. > It is likely to be the source of the problem indeed see https://issues.apache.org/jira/browse/TIKA-448 I gave a workaround for this issue but adding .flv to the URL filter is likely to do the trick > > Another possiiblity i thought of is that the threads were dead > garbage-collecting until hadoop recalls them as dead tasks. That would > make sense because that "zombie" situation lasts precisely 5 minutes all > the time. > > > does it make sense to you? > > > reinhard schwab wrote: > > connect with jconsole to the java vm of nutch and look at the stack > > traces of the threads. > > you will get more info there. > > > > Claudio Martella schrieb: > > > >> Hello, > >> > >> I'm using nutch 1.1 (with crawl command) to crawl an intranet document > >> archive via webdav. At the end of each fetch phase the fetcher hungs > >> like this: > >> > >> -activeThreads=5, spinWaiting=0, fetchQueues.totalSize=250 > >> > >> from my analysis of network traffic, nothing is passing by. The logs > show: > >> > >> 2010-06-30 13:38:35,335 INFO fetcher.Fetcher - fetching > >> > https://192.168.10.10/data/public/50.90_In_Bearbeitung/Stefano%20P/normen2010/normen2010.indd > >> 2010-06-30 13:38:35,381 INFO auth.AuthChallengeProcessor - basic > >> authentication scheme selected > >> 2010-06-30 13:38:35,819 INFO fetcher.Fetcher - -activeThreads=5, > >> spinWaiting=0, fetchQueues.totalSize=249 > >> 2010-06-30 13:38:36,824 INFO fetcher.Fetcher - -activeThreads=5, > >> spinWaiting=0, fetchQueues.totalSize=250 > >> > >> which i guess means i finish downloading the specified file and then it > >> hungs until: > >> > >> 2010-06-30 13:43:35,963 INFO fetcher.Fetcher - -activeThreads=5, > >> spinWaiting=0, fetchQueues.totalSize=250 > >> 2010-06-30 13:43:35,963 WARN fetcher.Fetcher - Aborting with 5 hung > >> threads. > >> > >> so basically 5 minutes without doing anything. > >> > >> this is my configuration in nutch-site.xml related to fetcher: > >> > >> <property> > >> <name>fetcher.server.delay</name> > >> <value>0.0</value> > >> </property> > >> > >> <property> > >> <name>fetcher.server.min.delay</name> > >> <value>0.0</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.fetch</name> > >> <value>5</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.per.host</name> > >> <value>5</value> > >> </property> > >> > >> <property> > >> <name>fetcher.threads.per.host.by.ip</name> > >> <value>false</value> > >> </property> > >> > >> Any idea why this is happening? > >> > >> > >> Thanks > >> > >> > >> Claudio > >> > >> > >> > > > > > > > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 > of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to > in Section 7 of Decree 196/2003. The data controller is TIS Techno > Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the > complete information on the web site www.tis.bz.it. > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

