Hi Claudio

On 2 July 2010 12:10, Claudio Martella <[email protected]> wrote:

> Thanks for the info. i didn't have this problem before nutch 1.1 where
> my biggest change was the introduction of parse-tika.
>
> 4 out of 5 threads are RUNNABLE in this situation:
>
> Stack trace:
> org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:878)
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
>
> which is actually pretty weird and difficult to understand for me. after
> the fetcher aborts with N hung threads, they just stay there hunging. In
> the jconsole i can still see the same stracktrace of these threads on
> FLVParser.parse(). They are not downloading (i can see the from the
> traffic on the network) and they should stop at a certain time (i have
> content limit at 50MB).
> Maybe the problem is that FLVParser.parse() doesn't like trunkated
> content for FLV streams. I'll try to craw-urlfilter-ignore the FLV files.
>

It is likely to be the source of the problem indeed see

https://issues.apache.org/jira/browse/TIKA-448

I gave a workaround for this issue but adding .flv to the URL filter is
likely to do the trick


>
> Another possiiblity i thought of is that the threads were dead
> garbage-collecting until hadoop recalls them as dead tasks. That would
> make sense because that "zombie" situation lasts precisely 5 minutes all
> the time.
>
>
> does it make sense to you?
>
>
> reinhard schwab wrote:
> > connect with jconsole to the java vm of nutch and look at the stack
> > traces of the threads.
> > you will get more info there.
> >
> > Claudio Martella schrieb:
> >
> >> Hello,
> >>
> >> I'm using nutch 1.1 (with crawl command) to crawl an intranet document
> >> archive via webdav. At the end of each fetch phase the fetcher hungs
> >> like this:
> >>
> >> -activeThreads=5, spinWaiting=0, fetchQueues.totalSize=250
> >>
> >> from my analysis of network traffic, nothing is passing by. The logs
> show:
> >>
> >> 2010-06-30 13:38:35,335 INFO  fetcher.Fetcher - fetching
> >>
> https://192.168.10.10/data/public/50.90_In_Bearbeitung/Stefano%20P/normen2010/normen2010.indd
> >> 2010-06-30 13:38:35,381 INFO  auth.AuthChallengeProcessor - basic
> >> authentication scheme selected
> >> 2010-06-30 13:38:35,819 INFO  fetcher.Fetcher - -activeThreads=5,
> >> spinWaiting=0, fetchQueues.totalSize=249
> >> 2010-06-30 13:38:36,824 INFO  fetcher.Fetcher - -activeThreads=5,
> >> spinWaiting=0, fetchQueues.totalSize=250
> >>
> >> which i guess means i finish downloading the specified file and then it
> >> hungs until:
> >>
> >> 2010-06-30 13:43:35,963 INFO  fetcher.Fetcher - -activeThreads=5,
> >> spinWaiting=0, fetchQueues.totalSize=250
> >> 2010-06-30 13:43:35,963 WARN  fetcher.Fetcher - Aborting with 5 hung
> >> threads.
> >>
> >> so basically 5 minutes without doing anything.
> >>
> >> this is my configuration in nutch-site.xml related to fetcher:
> >>
> >> <property>
> >>   <name>fetcher.server.delay</name>
> >>   <value>0.0</value>
> >> </property>
> >>
> >> <property>
> >>   <name>fetcher.server.min.delay</name>
> >>   <value>0.0</value>
> >> </property>
> >>
> >> <property>
> >>   <name>fetcher.threads.fetch</name>
> >>   <value>5</value>
> >> </property>
> >>
> >> <property>
> >>   <name>fetcher.threads.per.host</name>
> >>   <value>5</value>
> >> </property>
> >>
> >> <property>
> >>   <name>fetcher.threads.per.host.by.ip</name>
> >>   <value>false</value>
> >> </property>
> >>
> >> Any idea why this is happening?
> >>
> >>
> >> Thanks
> >>
> >>
> >> Claudio
> >>
> >>
> >>
> >
> >
> >
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to [email protected] in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to