another option is <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
which uses Tika, which parses PDF. On Tue, May 22, 2012 at 1:00 PM, Tolga <[email protected]> wrote: > Hi again, > > I am getting this error: org.apache.nutch.parse.**ParseException: parser > not found for contentType=application/pdf. I googled and found out that I > have to add a plugin.includes line to include pdf extension. However, I > already have that line. Actually, the whole <property> block looks like > this: > > <property> > <name>plugin.includes</name> > <value>protocol-http|**urlfilter-regex|parse-(text|** > html|js|msexcel|mspowerpoint|**msword|oo|pdf|swf|zip)|index-** > basic|query-(basic|site|url)|**summary-basic|scoring-opic|** > urlnormalizer-(pass|regex|**basic)</value> > <description>Some long description</description> > </property> > > However, I still get that error. > > What am I missing? > > Thanks, > > > On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote: > >> yes well then you should either set this property to -1 (which is a >> safe guard to ensure that you definitely crawl and parse all of your >> PDF's) or a a safe guard, responsible value to reflect the size of >> PDF's or other documents which you envisage to be obtained during your >> crawl. The first option has the downside that on occasion the parser >> can choke on rather large files... >> >> On Tue, May 22, 2012 at 10:36 AM, Tolga<[email protected]> wrote: >> >>> What is that value's unit? kilobytes? My PDF file is 4.7mb. >>> >>> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >>> >>>> Yes I know. >>>> >>>> If your PDF's are larger than this then they will be either truncated >>>> or may not be crawled. Please look thoroughly at your log output... >>>> you may wish to use the http.verbose and fetcher.verbose properties as >>>> well. >>>> >>>> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]> wrote: >>>> >>>>> The value is 65536 >>>>> >>>>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >>>>> >>>>>> try your http.content.limit and also make sure that you haven't >>>>>> changed anything within the tika mimeType mappings. >>>>>> >>>>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote: >>>>>> >>>>>>> Sorry, I forgot to also add my original problem. PDF files are not >>>>>>> crawled. >>>>>>> I even modified -topN to be 10. >>>>>>> >>>>>>> >>>>>>> -------- Original Message -------- >>>>>>> Subject: PDF not crawled/indexed >>>>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >>>>>>> From: Tolga<[email protected]> >>>>>>> To: [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am crawling my website with this command: >>>>>>> >>>>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >>>>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >>>>>>> >>>>>>> Is it a good idea to modify the directory name? Should I always >>>>>>> delete >>>>>>> indexes prior to crawling and stick to the same directory name? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>> >> >>

