HtmlParseFilter and tika metadata

2013-01-29 Thread webdev1977
Hello all.. Probably a silly question but I can't find it for the life of me. During the execution of a custom HtmlParseFilter can I access the Tika determined metadata for my document? If so what object does it exist in? Content, ParseResult, HTMLMetaTags ? I can't find it and I want to che

MoreIndexingFilter last-modified time from protocol-file docx

2012-12-11 Thread webdev1977
Using nutch 1.4 and Solr 3.6 I see the bug that was submitted for the indexing filter not recognizing dates in the format: -MM-dd'T'HH:mm:ss'Z' but I am still having issues with it. This only happens with any office documents with the "x"

RE: Relative urls - outlinks

2012-09-18 Thread webdev1977
NOOOo!!! Just kidding! :-) So maybe you can clear something up for me. In the future while building a new crawldb, if I only wanted to accept urls from the following: http://myhost:81/site1/test.php?id=1234 http://myhost:81/site1/list.php?page=1234&count=21 http://myhost:81/site1/view.php

Relative urls - outlinks

2012-09-18 Thread webdev1977
Is there anyway to keep nutch from generating outlinks for any RELATIVE urls? I basically don't want to use ANY relative urls that I find.. Then the next question is how do I get them out of my crawldb :-) -- View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-outl

RE: Cached page (like google) with hits highlighted

2012-08-28 Thread webdev1977
PDF2XHTML is already being loaded by the pdf parser. Something is not adding it to the DocumentFragment however, I can't seem to find out where? * any other ideas? * I don't want to run Tika separately during the parse step to get the XHTML (seems silly) but I will if I absolutely have to. --

RE: Cached page (like google) with hits highlighted

2012-08-28 Thread webdev1977
PDF2XHTML is already being loaded by the pdf parser. Something is not adding it to the DocumentFragment however, I can't seem to find out where? * any other ideas? * I don't want to run Tika separately during the parse step to get the XHTML (seems silly) but I will if I absolutely have to. --

Re: Cached page (like google) with hits highlighted

2012-08-16 Thread webdev1977
Thanks Julien and Markus for all your help. I poked around the code some more yesterday and it seems like the markup is just not getting in the DocumentFragment. All I get (for word and pdf) is just one html tag with the text of the document in between. Maybe something is not using parse-tika pr

RE: Cached page (like google) with hits highlighted

2012-08-15 Thread webdev1977
tika-app (the gui) gives me back the xhtml just fine.. not sure what is going on here.. maybe it is not stored properly in the documentfragment upon parsing? -- View this message in context: http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001449.html

RE: Cached page (like google) with hits highlighted

2012-08-15 Thread webdev1977
Does the 1.4 version of nutch have tika-app? Also..maybe I am not using the DocumentFragment object properly? Below is a summary version of my code: public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { for (int x = 0; x

RE: Cached page (like google) with hits highlighted

2012-08-15 Thread webdev1977
Thanks Markus! So after some testing and walking the DocumentFragment, I see that all I get is one node: some content here and here I guess I expected to see more from a PDF/word document (like H1 tags, etc) that would help make the xhtml format more readable. Am I missing something? Do I have

Cached page (like google) with hits highlighted

2012-08-15 Thread webdev1977
Hello Everyone! I am up and running with my nutch 1.4 /solr 3.3 architecture and am looking to add a few new features. My users want the ability to view their solr results as xhtml with the hits highlighted in the document. So a word document/pdf would become an XHTML version first. I see th

Deleting file: urls from crawldb that give 404 status

2012-06-19 Thread webdev1977
I am having an issue with removing deleted file: urls on subsequent crawls. It stays with a status of db_unfetched and doesn't seem to want to use the 404 (db_gone) status. This means that I can't run solrclean to get rid of the old file: urls. I poked around in the protocol-file code and made

Relative urls, interpage href anchors

2012-03-27 Thread webdev1977
I am seeing an issue with crawling html pages that have relative urls embedded in them. I know there is an ongoing issue related to relative urls that begin with a ?. But this seems to be a different issue. In regex-normalize.xml there is the following pattern: #.*?(\?|&|$) $1 Here is my

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I think I may have figured it out.. but I don't know how to fix it :-( I have many pdfs and html files that have relative links in them. They are not from the originally hosted site, but are re-hosted. Nutch/Tika is trying to prepend the relative urls in incounters with the url that contained th

Re: Older plugin in Nutch 1.4

2012-03-26 Thread webdev1977
I believe it is complaining about this: *public void addIndexBackendOptions(Configuration conf) { LuceneWriter.addFieldOptions(MP3_TRACK_TITLE, LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(MP3_ALBUM, LuceneWriter.STORE.YES, LuceneWriter.

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I guess I STILL don't understand the topN setting. Here is what I thought it would do: Seed: file:myfileserver.com/share1 share1 Dir listing: file1.pdf ... file300.pdf, dir1 ... dir20 running the following in a never ending shell script: {generate crawl/crawldb crawl/segments -topN 1000 f

db_unfetched large number, but crawling not fetching any longer

2012-03-23 Thread webdev1977
I was under the impression that setting topN for crawl cycles would limit the number of items each iteration of the crawl would fetch/parse. However, eventually after continuously running crawl cycles it would get ALL the urls. My continuous crawl has stopped fetching/parsing and the stats from c

Re: crawl and update one url already in crawldb

2012-03-22 Thread webdev1977
I just tried it out and so far so good.. Not an near instant solution, but it works ;-) One last question.. If I am running a bunch of bin/nutch commands from the same directory I seem to be having an issue. I am assuming it is with the mapred system and various tmp files (running in local mode

Re: crawl and update one url already in crawldb

2012-03-22 Thread webdev1977
Thanks for the quick response Markus! How would that fit into this continuous crawling scenario (I am trying to get the updates as quickly as possible into solr :-) If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT --> updatedb crawldb $segment --> solrindex --> solrdedub cycle a

crawl and update one url already in crawldb

2012-03-22 Thread webdev1977
I have created an application that can detect when files are created/modified/deleted in one of our Windows Share drives and I would like to know if it is possible upon notification of this to crawl just a single URL in the crawldb? I think it is possible to run individual new crawls for each url

Hostnames changed for lots of URLS in crawldb, solr index, how to change?

2012-03-12 Thread webdev1977
How would one go about changing the hostnames that a large number of urls point to in both the crawldb as well as the solr index? I tried running the updatedb with the -normalize switch on. I added a regular expression in regex-normalize.xml. Then I ran the solrindex command, but nothing seemed to

Re: Optimizing crawling for small number of domains/sites (aka. intranet crawling)

2012-03-12 Thread webdev1977
Well, running it with 200 fetcher threads and no delay works for about 20 minutes.. then the file server crashed . So I think that the DNS queries are the issue. I am not able to setup my own DNS server, but I did find this setting in java.security: networkaddress.cache.ttl. Since I am usin

Optimizing crawling for small number of domains/sites (aka. intranet crawling)

2012-03-06 Thread webdev1977
Is there a guide to optmizing nutch/hadoop for crawling intranet sites? Most of what I need to crawl are large stores of data (databases exposed through html), share drive content, etc. I have a very very small number of "sites" to crawl (two dbs and one share drive). The file share crawling is

Re: Large Shared Drive Crawl

2012-02-28 Thread webdev1977
What is a reasonable number of threads? What about memory? Where is the best place to set that in the nutch script? in one of the config files. I abandoned using distributed mode (10 slaves), it was taking WAY to long to crawl the web and share drives in my enterprise, not to mention I am

Re: Large Shared Drive Crawl

2012-02-28 Thread webdev1977
OH.. forgot to say.. no I am not parsing while fetching. I had more problems with that so I turned it off. -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783706.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

2012-02-28 Thread webdev1977
Thanks for the reply! I guess I don't mind using topN as long as I can be assured that I will get ALL of the urls crawled eventually. Do you know if that is a true statement? -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783663.html Sent

Large Shared Drive Crawl

2012-02-27 Thread webdev1977
I am attempting to crawl a very large intranet file system using Nutch and I am having some issues. At one point in the crawl cycle I get an java heap space error during fetching. I think it is related to the number of URLS listed in the segment to be fetched. I do want to crawl/index EVERYTHING

RE: Crawling Local Files within Cygwin

2012-02-21 Thread webdev1977
I am having the same issue with 1.4! It was working fine in 1.3 and 1.2. Any ideas what specific config changes that made the difference? -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-Local-Files-within-Cygwin-tp3712116p3764766.html Sent from the Nutch - User mail

Question regarding NutchHadoopTutorial

2012-02-16 Thread webdev1977
In this tutorial: http://wiki.apache.org/nutch/NutchHadoopTutorial tutorial the following is stated: "every node you wish to include within your cluster e.g. both Nutch and Hadoop packages should be installed in every machine." I am curious as to why every node must contain a nutch distributio

Re: Stylesheet in plugin not found when run in distributed mode

2012-02-16 Thread webdev1977
As I suspected, based on the code changes I combed through from 1.3 to 1.4, upgrading to 1.4 did not fix the issue. I still can not complete solrindex. All other phases work fine. It still is trying to find my stylesheet in a place that does not exist (see original post). Any other ideas? --

Re: Stylesheet in plugin not found when run in distributed mode

2012-02-15 Thread webdev1977
I am wondering if this is actually a bug that has not been discovered/fixed yet. The problem only occurs in the solrindex phase of the crawl. All other cycles inject, generate, fetch, parse, invertlinks & updatedb work fine. -- View this message in context: http://lucene.472066.n3.nabble.com/S

Re: Stylesheet in plugin not found when run in distributed mode

2012-02-13 Thread webdev1977
what if I told you that it isn't easy in any way shape or form to update my nutch version? Is there a patch I could apply? -- View this message in context: http://lucene.472066.n3.nabble.com/Stylesheet-in-plugin-not-found-when-run-in-distributed-mode-tp3740629p3740755.html Sent from the Nutch -

Re: Stylesheet in plugin not found when run in distributed mode

2012-02-13 Thread webdev1977
1.3 -- View this message in context: http://lucene.472066.n3.nabble.com/Stylesheet-in-plugin-not-found-when-run-in-distributed-mode-tp3740629p3740692.html Sent from the Nutch - User mailing list archive at Nabble.com.

Stylesheet in plugin not found when run in distributed mode

2012-02-13 Thread webdev1977
Hello All: I am running into an interesting issue that I first thought was related to https://issues.apache.org/jira/browse/MAPREDUCE-967 MAPREDUCE-967 where the crawl could not find the plugin directory because it was not unpacked properly. I tried the suggestions listed and for some phases of

Re: Java out of memory error

2012-02-08 Thread webdev1977
I am curious as to what ever came of this? I am having the exact same issue with nutch 1.3 -- View this message in context: http://lucene.472066.n3.nabble.com/Java-out-of-memory-error-tp3592026p3727125.html Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Crawling Local Files within Cygwin

2012-02-07 Thread webdev1977
I have done this using a similar setup and you have to play around with the number of slashes that you use after the "file:" I don't remember off the top of my head the correct number, but try up to six. (Crazy I know!) -- View this message in context: http://lucene.472066.n3.nabble.com/Crawlin

Nutch 1.3 Fetching where does this happen?

2011-10-05 Thread webdev1977
Hello All! When using nutch 1.3 in fully distributed mode, where does the fetching occur? Does each node get a list of urls to fetch? What property in hadoop/mareduce, etc decides how many urls that a node gets to fetch? I am worried about memory on my nodes. Some of the files in our enterpri

Finally got hadoop + nutch 1.3 + cygwin cluster working! ? now

2011-09-29 Thread webdev1977
I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0 and cygwin! I have a few questions about configuration. I am only going to be crawling a few domains and I need this cluster to be very fast. Right now it is slower using hadoop in distributed mode then using just the l

Re: protocol-httpclient

2011-09-28 Thread webdev1977
Are there any plans to fix the protocol-httpclient plugin? I do not have the nor the expertise necessary to upgrade it myself. I mean I COULD do it, but it would take me ions :-) -- View this message in context: http://lucene.472066.n3.nabble.com/protocol-httpclient-tp3216821p3376333.html Sent f

Re: Nutch and Hadoop not working proper

2011-09-21 Thread webdev1977
I found a workaround to the exact issue and error message encountered by the OP. I think this only applies to running hadoop (0.20.2) a Windows (cygwin) environment Add the following setting to the mapred-site.xml: hadoop.job.history.user.location none true I have not tried it with this

Re: Nutch and Hadoop not working proper

2011-09-19 Thread webdev1977
I know this is old, but were you ever able to resolve this issue? I am having the same problem. I have traced it back in the code to an init method that uses the value of hadoop.log.history.location. I have tried setting this in both core-site as well as hdfs-site to no avail. -- View this mes

Re: Nutch 1.3 + Cygwin + hadoop + paths

2011-09-19 Thread webdev1977
I was afraid of this :-( I can't believe that no one has tried this configuration yet? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Cygwin-hadoop-paths-tp3336911p3348154.html Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch 1.3 + Cygwin + paths

2011-09-14 Thread webdev1977
I am having a hard time getting nutch 1.3 to run in a pseudo distributed mode on Windows Server 2008 sp2. I spent a week messing with hadoop version 0.20.203.0 and I have come to the conclusion that it is not possible to start the task tracker due to an issue in RawLocalFileSystem.java(515). It

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-09-01 Thread webdev1977
Is it generally not recommended that cygwin is used to run hadoop? There is no way I am getting a linux box :-( -- View this message in context: http://lucene.472066.n3.nabble.com/SSHD-for-Nutch-1-3-in-Pseudo-Distributed-mode-tp3292907p3302240.html Sent from the Nutch - User mailing list archive

Re: SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-09-01 Thread webdev1977
I FINALLY got sshd to work. Turns out I had a bum installation of cygwin and openssh. I figured as much when I would run ssh and all it would do is give me the usage statement! Now if I could just get the job to run in hadoop :-(. It is stuck.. and has been on this: INFO mapred.JobClient: map

SSHD for Nutch 1.3 in Pseudo Distributed mode

2011-08-29 Thread webdev1977
Do I NEED SSHD for Nutch 1.3 in Pseudo Distributed mode? I am running on a windows server using cygwin (obviously :-) I can not get haddop/nutch to run in deploy mode and I am not sure if it has something to do with ssh or not. When I run start-all.sh it gives me some ssh usage errors and also

Re: force recrawl

2011-08-19 Thread webdev1977
I was going to ask the SAME question :-) I think it is a PITA that you can't force a recrawl. Wonder if could be accomplished by altering the codebase? -- View this message in context: http://lucene.472066.n3.nabble.com/force-recrawl-tp3268654p3268748.html Sent from the Nutch - User mailing lis

Re: Is running nutch in psuedo-distributed mode really worth it?

2011-08-18 Thread webdev1977
The tutorial that exists on the Nutch wiki is for versions < 1.3 Does it still generally apply to Nutch 1.3? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-running-nutch-in-psuedo-distributed-mode-really-worth-it-tp3255677p3264761.html Sent from the Nutch - User mailing l

Is running nutch in psuedo-distributed mode really worth it?

2011-08-15 Thread webdev1977
I have been looking at pros and cons of running nutch locally in psuedo-distributed mode. I have a very large machine with lots of processors and memory (16gb). I am not able to get more machines to setup a proper hadoop cluster. Is it worth the overhead to setup hadoop in pseduo distributed

Re: protocol-httpclient

2011-08-02 Thread webdev1977
Thanks for your reply! I had not seen any weird exceptions before using it in v. 1.2 This version I am able to fetch the first page from an https html page, but then it doesn't find any outlinks. I tried the ParserChecker and got the same results. So it stops after this first round. I have tr

Re: Fetched pages has no content

2011-08-02 Thread webdev1977
both are in the list, but I guess since parse-html is listed first, it wins.. -- View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3218585.html Sent from the Nutch - User mailing list archive at Nabble.com.

protocol-httpclient

2011-08-01 Thread webdev1977
I have just recently learned that it is recommended not to use protocol-httpclient due to the underlying commons http library and problems with this. I am very disappointed to learn this as about half of my domains to crawl use https and require certs. Does anyone know how much of an effort it w

Re: Fetched pages has no content

2011-08-01 Thread webdev1977
I had protocol-httpclient working in 1.2 and sending certificates for a group of sites. I moved the plugin over to the 1.3 environment and it won't work.. I am having the same issue as the OP.. no content parsed for the seed url. I see it come in on debug.wire... https://domain.com/test.php?

Re: Fetched pages has no content

2011-08-01 Thread webdev1977
So I am not crazy, the protocol-httpclient IS broken!? I have been wondering for a week or two what has changed between 1.2 and 1.3 that would have caused such a problem. Is there a JIRA open for the issue? -- View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-n

Re: How to build nutch 1.3 without an internet connection

2011-07-12 Thread webdev1977
I won't when I start making code changes :-) How can I compile/deploy just my plugin changes without internet. I don't want to make any changes to the core code, but I have had to do that in the past. And doing it "off the network" seems like it might be a PIA? -- View this message in context:

How to build nutch 1.3 without an internet connection

2011-07-12 Thread webdev1977
I need the ability to build nutch 1.3 with ANT without being connected to the internet (looks like ivy is used to download dependent libs). Is this possible? What do I have to modify to make this happen? Thanks!! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-build-n

Re: Going Beyond the Prototype

2011-05-25 Thread webdev1977
I sure have! And I cranked it up at one point sooo much that I crashed the TNS listener on our DB!! The fetching part didn't take the longest, it was the mapreduce that took forever.. -- View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p298481

Re: Going Beyond the Prototype

2011-05-25 Thread webdev1977
Any ideas on how (even if it requires code changes) to speed up the mapreduce portion for a vertical crawl with a very (three right now) small number of sites? -- View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2984011.html Sent from the Nutch

Re: Going Beyond the Prototype

2011-05-16 Thread webdev1977
Julien Nioche-4 wrote: > >> I was saying that based on what the previous poster stated. Also the >> fact >> that I have read through quite a bit of posts stating that the problem >> with >> crawling in a vertical environment has to do with the way fetcher2 was >> built. The fetches are grouped

Re: Going Beyond the Prototype

2011-05-12 Thread webdev1977
I was saying that based on what the previous poster stated. Also the fact that I have read through quite a bit of posts stating that the problem with crawling in a vertical environment has to do with the way fetcher2 was built. The fetches are grouped by domain name and if you have a lot of urls

Re: Going Beyond the Prototype

2011-05-12 Thread webdev1977
Dare I ask... What sorts of crawlers are meant for use with vertical search systems? -- View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2932884.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Going Beyond the Prototype

2011-05-10 Thread webdev1977
Thanks for your reply.. I was kind of afraid someone was going to say that :-( I have invested so much time into developing plugins for Nutch that I am deathly afraid of moving on to something else. To answer your questions: 1) What kind of documents/repositories are you trying to provide search f

Going Beyond the Prototype

2011-05-10 Thread webdev1977
I have been working on an off for about a year now on developing a prototype for Enterprise Search using Nutch and Solr. I have also incorporated a plugin using the hive-mrc google code for automatic tagging based on a custom taxonomy that my customer uses. I have been slowly migrating up the cha

Re: https authentication

2011-04-14 Thread webdev1977
Is it possible to setup a java keystore with a certificate and then pass that info as parameters on the java runtime call? So, add: javax.net.ssl.xxx for each parameter? -- View this message in context: http://lucene.472066.n3.nabble.com/https-authentication-tp2547192p2820512.html Sent from t

Re: https authentication

2011-04-11 Thread webdev1977
I have this same question! Anyone? Anyone at all? -- View this message in context: http://lucene.472066.n3.nabble.com/https-authentication-tp2547192p2806184.html Sent from the Nutch - User mailing list archive at Nabble.com.

HtmlParseFilter custom Plugin, How to extract more then one tag on page.

2011-03-14 Thread webdev1977
All of the examples of custom implementations of HtmlParseFilter seems to be suited to one pattern matching for one tag in an html page.. For instance, this code snippet below (thanks to: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html This Blog for the example) Ho

HtmlParseFilter custom Plugin, How to extract more then one tag on page.

2011-03-14 Thread webdev1977
Any ideas would be greatly appreciated! All of the examples of custom implementations of HtmlParseFilter seems to be suited to one pattern matching for one tag in an html page.. For instance, this code snippet below (thanks to: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-an

NUTCH-844 back port to 1.2??

2011-02-01 Thread webdev1977
I am trying to use the stable verision of Nutch, 1.2, but I am encountering memory leaks that need to be fixed. I think this patch will fix the issues, but I see it is only for v2.0. I tried to apply the patches to 1.2 and obviously there are lots of issues (specifically with Crawl.java). https

Re: fetcher.store.content and fetcher.parse

2010-10-07 Thread webdev1977
So how is it that one is able to crawl huge websites with the crawl script and not use the parse = false? You would have to have enormous amounts of disk space to run the parse later. I am not even able to run with fetcher.parse=false and fetcher.store.content= true. I get an out of memory er

fetcher.store.content and fetcher.parse

2010-10-07 Thread webdev1977
Could someone please clarify the relationship between these two properties? I have been reading that it is not wise to set fetcher.parse to true, but if you set it to false and then set fetcher.store.content to false you get an error during the crawl: Exception in thread "main" org.apache.had

Re: Nutch on file system and web

2010-10-06 Thread webdev1977
That is a very good question! I am currently only crawling my local file system, but am about to add an http url, I would love to know the answer. Have you given it a try yet? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-on-file-system-and-web-tp1628971p1642548.htm

Re: Not getting all documents

2010-10-01 Thread webdev1977
Good Morning.. I was wondering if you ever found a solution to your problem? I am facing the same problem. I am missing about 300,000 fetched files. I can't for the life of me figure out why it is not getting all the urls? -- View this message in context: http://lucene.472066.n3.nabble.com/N

Stack Trace from Crawling filesystem - OutOfMemoryError: PermGen Space

2010-09-23 Thread webdev1977
I would appreciate any help anyone could lend. A very deep crawl of a file system using release canidate 1.2 #4 produces an OutOfMemory error after about two hours of running. I am parsing html/text/tika/pdf/zip. Any ideas? FetcherThread" daemon prio=6 tid=0x0559dc00 nid=0xd38 runnable [

Re: Tika Excel parsing causing out of memory

2010-08-19 Thread webdev1977
The above stack trace is related to the same issue that this person is having. The merger task in mapred is trying to load too much into memory at one time. Anyone know if there is any mapred property that controls the size of bytes that the merger tries to do at one time? I suspect this would

Re: Tika Excel parsing causing out of memory

2010-08-18 Thread webdev1977
For more info, below is the dump from the OutOfMemoryError: Thread-347" prio=5 tid=390 RUNNABLE at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) Local Variable: org.apache.hadoop.io.Data

Re: Tika Excel parsing causing out of memory

2010-08-18 Thread webdev1977
In theory, if I find and get rid of the "bad" excel files, which I may have done, to overcome this hurdle, is it even possible to crawl a large file system (approx. 350,000 files) with such a small box? Here is my config: Windows XP SP3 Core 2 Duo CPU at @ 3.00GHz 1.95 GB of Ram If it is p

Re: Have yet to complete a very large filesystem crawl

2010-08-11 Thread webdev1977
Claudio Martella wrote: > > > personally, i solved this by applying this patch: > > https://issues.apache.org/jira/browse/NUTCH-696 > > it will kill the hangup threads. this is not ideal, but it will avoid > eating up all your memory. > > I am using the tagged 1.2 version from SVN.. I thou

Re: Have yet to complete a very large filesystem crawl

2010-08-11 Thread webdev1977
Doğacan Güney-3 wrote: > > On Wed, Aug 11, 2010 at 18:23, webdev1977 wrote: > >> >> I am using tika... should I not be? The problem is that this shared >> drive >> has such a diverse set of documents, I was trying to include as many >> document ty

Re: Have yet to complete a very large filesystem crawl

2010-08-11 Thread webdev1977
I am using tika... should I not be? The problem is that this shared drive has such a diverse set of documents, I was trying to include as many document types as possible. There are some really really office documents that can't be open by the newer versions of office. I was having problems in n

Re: Have yet to complete a very large filesystem crawl

2010-08-11 Thread webdev1977
Some more info. Seems to be hung on the MapReduce task. console output: finishing thread FetcherThread, activeThreads=9 finishing thread FetcherThread, activeThreads=8 finishing thread FetcherThread, activeThreads=7 activeThreads=7, spinWaiting=0, fetchQueues.totalSize=0 finishing thread Fetch

Re: Have yet to complete a very large filesystem crawl

2010-08-11 Thread webdev1977
That would make sense, but I am pretty sure this is not the issue. In this config, I am running with 1024mb of memory. I kind of thought that nutch was able to run on this amount of memory? It would just take much longer. I tried to run the same crawl using the SMB plugin on a Linux machine wi

Re: Question about plugin protocol-smb

2010-08-05 Thread webdev1977
Well.. this is strange, but one thing that I have found is that you have to end your url in the seed list with a "/".. So for instance: smb:///homes/users/ not smb:///homes/users Not sure why, but it made a difference for me? In the smb.properties file, you need to set the smb user and pas

Re: Question about plugin protocol-smb

2010-08-05 Thread webdev1977
You also have to put the jcifs library in the jre/lib/ext folder. And you also have to add this to the nutch crawl script: NUTCH_OPTS="$NUTCH_OPTS -Djava.protocol.handler.pkgs=jcifs" Hope this helps you some! -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-p

Re: File System Crawling

2010-07-14 Thread webdev1977
Here is some more info on the slow down: The SMB protocol plugin and the fetcher seems to be getting caught up on an empty directory listing -- View this message in context: http://lucene.472066.n3.nabble.com/File-System-Crawling-tp963557p966440.html Sent from the Nutch - User mailing list arch

Re: File System Crawling

2010-07-14 Thread webdev1977
Thanks for all your help.. I applied that patch and I also added the property that Brad described. I am not receiving an out of memory error: Reading content of SMB directory: 19A475BB-A31E-473A-BD05-62FA081F20F7/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread,

Re: File System Crawling

2010-07-13 Thread webdev1977
Of course I just deleted my src directory from the nutch 1.1 binary distro :-( .. Does this class end up in the nutch-1.1.jar file once compiled? I am just thinking I might download the src again apply this patch, build the distro and copy the proper jar over to my working copy? Do you think tha

File System Crawling

2010-07-13 Thread webdev1977
Hello List! I am trying to find a combination of the best settings for topN and depth for running the crawl script on a very large internal filesystem. I have tried setting the depth to a very high number (1000), but I fail to complete the crawl. The main reason for this is the number of "bad"