Hi Piet, We will hopefully be pushing 1.5 in the next few days so please watch this space.
Thanks On Tue, May 22, 2012 at 11:43 AM, Piet van Remortel <[email protected]> wrote: > Ok thanks, that property seems the right solution indeed, but it's not part > of the 1.4 release that I currently use. > Current source trunk includes it though. > > On Tue, May 22, 2012 at 12:31 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Well the value is in bytes. So anything above the default (~65000) is >> truncated. >> Ferdy also introduced a parser.skip.truncated property which is set to >> true by default. Justification on this is that parsing can sometimes >> take extremely high levels of CPU which then leads to the parser >> choking. >> >> On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel >> <[email protected]> wrote: >> > I have been dealing with the exact same issues, and I wonder what happens >> > to PDF's that exceed the file size limit, are they cropped (and partly >> > parsed?) or fully ignored ? I seem to observe parsing problems in PDFs >> > since using a file size limit. Setting the limit to -1 indeed caused >> > consistent choke errors on large pages/files so setting a hard limit >> seemed >> > the only option. >> > >> > thanks >> > >> > Piet >> > >> > >> > On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney < >> > [email protected]> wrote: >> > >> >> yes well then you should either set this property to -1 (which is a >> >> safe guard to ensure that you definitely crawl and parse all of your >> >> PDF's) or a a safe guard, responsible value to reflect the size of >> >> PDF's or other documents which you envisage to be obtained during your >> >> crawl. The first option has the downside that on occasion the parser >> >> can choke on rather large files... >> >> >> >> On Tue, May 22, 2012 at 10:36 AM, Tolga <[email protected]> wrote: >> >> > What is that value's unit? kilobytes? My PDF file is 4.7mb. >> >> > >> >> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote: >> >> >> >> >> >> Yes I know. >> >> >> >> >> >> If your PDF's are larger than this then they will be either truncated >> >> >> or may not be crawled. Please look thoroughly at your log output... >> >> >> you may wish to use the http.verbose and fetcher.verbose properties >> as >> >> >> well. >> >> >> >> >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]> wrote: >> >> >>> >> >> >>> The value is 65536 >> >> >>> >> >> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: >> >> >>>> >> >> >>>> try your http.content.limit and also make sure that you haven't >> >> >>>> changed anything within the tika mimeType mappings. >> >> >>>> >> >> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote: >> >> >>>>> >> >> >>>>> Sorry, I forgot to also add my original problem. PDF files are not >> >> >>>>> crawled. >> >> >>>>> I even modified -topN to be 10. >> >> >>>>> >> >> >>>>> >> >> >>>>> -------- Original Message -------- >> >> >>>>> Subject: PDF not crawled/indexed >> >> >>>>> Date: Tue, 22 May 2012 10:48:15 +0300 >> >> >>>>> From: Tolga<[email protected]> >> >> >>>>> To: [email protected] >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> Hi, >> >> >>>>> >> >> >>>>> I am crawling my website with this command: >> >> >>>>> >> >> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr >> >> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5 >> >> >>>>> >> >> >>>>> Is it a good idea to modify the directory name? Should I always >> >> delete >> >> >>>>> indexes prior to crawling and stick to the same directory name? >> >> >>>>> >> >> >>>>> Regards, >> >> >>>>> >> >> >>>> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Lewis >> >> >> >> >> >> -- >> Lewis >> -- Lewis

