Re: crawl/index/search

2006-09-24 Thread Richard Braman
requirements and what tools are available. - Original Message - From: Richard Braman [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, September 20, 2006 12:55 PM Subject: Re: crawl/index/search Getting other information out of the page requires parsing. In this case

Re: crawl/index/search

2006-09-19 Thread Richard Braman
Getting other information out of the page requires parsing. In this case you have to come up with some pretty complicated regular expressions unless the information that you want like the company name is going to be in the same place on each site. I don't know know how to tackle this problem

Re: Stemming and Synonyms

2006-09-19 Thread Richard Braman
I dont think it should be 7.2 before we get some natural language processing. especially if there is public collaboration with nutch community and the folks at http://opennlp.sourceforge.net/ :-0 Tomi NA wrote: On 9/19/06, Gonçalo Gaiolas [EMAIL PROTECTED] wrote: Hi everyone! I'm using

Re: Nutch running on FC5 - No search results yet

2006-09-19 Thread Richard Braman
. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 19, 2006 5:00 AM To: nutch-user@lucene.apache.org Subject: Nutch running on FC5 - No search results yet Thanks Andrzej, when I installed Fedora Core 5 it had an option for Java development kit

fetch failed with null metadata

2006-09-18 Thread Richard Braman
while fetching I encountered this error in my logs, seems if there is enough these, then the the fetcher stops. fetch of http://res.alaskacruises.com/travel/cruise/sailplan.rvlx?CruiseItineraryID=69717 failed with: java.lang.IllegalArgumentException: null metadata Any ideas? This is on

Nutch running on FC5 - No search results yet

2006-09-18 Thread Richard Braman
Thanks Andrzej, when I installed Fedora Core 5 it had an option for Java development kit, which I incorrectly assumed was JDK. I was able to get JDK up and running on Fedora using JPackage for Sun Compat There are some good instruction here for other who may need to get nutch up on FC5.

restarting fetch

2006-08-24 Thread Richard Braman
If you get a stop error in the middle of a fetch, should you refetch the segment, or just do another generate and fetch the newly generated segment? Likewise, if you have an error during index, can you just rererun the index command

.8 Searching

2006-03-25 Thread Richard Braman
Is searcher.dir still valid property in nutch-site.xml? I am using nutch on cygwin with Tomcat 5.5, and nutch trunk I have done a few segements but getting no results. property namesearcher.dir/name valueT:\nutch-trunk\taxcrawl/value /property Richard Braman mailto:[EMAIL

RE: .8 Searching

2006-03-25 Thread Richard Braman
, and the instructions are different, maybe we need to have two different Tutorials on the WIKI after all. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Saturday, March 25, 2006 5:33 PM To: nutch-user@lucene.apache.org Subject: .8 Searching Is searcher.dir still valid property

RE: Can't index Japanese PDF

2006-03-22 Thread Richard Braman
I would forward this to [EMAIL PROTECTED] -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 21, 2006 12:23 PM To: nutch-user@lucene.apache.org Subject: Can't index Japanese PDF In my quick experiments, Nutch 0.7.1 (with bundled PDFBox which I

.job file

2006-03-22 Thread Richard Braman
Getting back to nutch after doing some more legwork on pdf parsing, I got nutch from HEAD and built it. I noticed that there is a .job file created by the build. Is this something new in .08. Can you run nutch as a scheduled task now? Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002

.08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310) at org.apache.nutch.crawl.Injector.inject(Injector.java:114) at org.apache.nutch.crawl.Injector.main(Injector.java:138) Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
I am not trying to use hadoop dfs, this is just a single nutcher and a single searcher on a single server configuration. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 22, 2006 11:41 PM To: nutch-user@lucene.apache.org Subject: .08

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
ot the Nutch 0.8 tutorial. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Thursday, March 23, 2006 12:38 AM To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults: I

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
-Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Thursday, March 23, 2006 1:02 AM To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults: Hadoop-site.xml from trunk

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
I copied in hadoop-default.xml and mapred-default.xml from hadoop trunk into conf and still same error. What input directory is it looking for? -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Thursday, March 23, 2006 1:11 AM To: nutch-user@lucene.apache.org

RE: try to parse pdf

2006-03-13 Thread Richard Braman
That error is actually not from the http content limit, but I would recommend setting the content limit to -1. For some reason this error sems to happen sometimes even after you add the pdf parsing plug in like you did. I think nutch must cache the plug in properties in nutch-default. It will

FW: about nutch

2006-03-12 Thread Richard Braman
-Original Message- From: Alen [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2006 1:42 AM To: rbraman Subject: about nutch Dear,rbraman First of all, thank you you can read this email.I have some problem on nucth,could you give me some advise? I

RE: URL containing ?, and =

2006-03-10 Thread Richard Braman
Woa! If you want to include all urls don't do +, as that will make all urls with ?= get fecthed, ignoring all of your other filters just comment the line out. -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 8:27 AM To: nutch-user

RE: URL containing ?, and =

2006-03-10 Thread Richard Braman
Just to be clear, what marko said [EMAIL PROTECTED] Is correct. Comment the line out. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 8:50 AM To: nutch-user@lucene.apache.org Subject: RE: URL containing ?, and = Woa! If you want

RE: Indexing a web site over HTTPS using username/passwd

2006-03-09 Thread Richard Braman
I don't know Dan, but its something on my list too. I kind of doubt that this is a feature in nutch, because generally this is thought of as specialized intelligent agent (IA) capability instead of more general spidering/indexing technology. Certainly it is possible to do, but there are two

RE: writing a metadata content tag:use case example

2006-03-09 Thread Richard Braman
I am following this thread as I have a similar issue to deal with in my coming developments. Howie thanks for your insights into this as I think this may solve my problem. I am trying to index Title 26 of the US Code http://www.access.gpo.gov/uscode/title26/title26.html The problem is I don't

RE: project vitality? / less documentation is more!

2006-03-07 Thread Richard Braman
+1 -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 3:01 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, Just my 2 cents: the Intranet crawl functionnality is VERY confusing. If it

RE: project vitality? / less documentation is more!

2006-03-07 Thread Richard Braman
+1 -Original Message- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 10:11 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, single site crawling wouldn't address the confusion that results from the fact

retry later

2006-03-07 Thread Richard Braman
maintained somewhere? Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open Source Tax Software

still not so clear to me

2006-03-07 Thread Richard Braman
on the seed urls, then on the links found on that page (for each subsequent iteration), then on the links on those pages, and so forth and son on until the entire domain is crawled, if you limit the domains with a filter. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http

going deeper, lost segment

2006-03-05 Thread Richard Braman
close. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open Source Tax Software

RE: NullPointerException

2006-03-05 Thread Richard Braman
-string] in shell/cmd? I guess it works.. /Jack You fetched one single website, i think On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: Gentlemen: On 05/03/06, Richard Braman [EMAIL PROTECTED] wrote: This sounds like your crawl didn't get anything. I have seen that happen when the url

RE: [Nutch-general] Re: project vitality?

2006-03-05 Thread Richard Braman
I'll take part in your forum. Just added first post. -Original Message- From: Greg Boulter [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 6:33 PM To: nutch-user@lucene.apache.org Subject: Re: [Nutch-general] Re: project vitality? Hello again. OK - first of all I hate mailing

RE: how can i go deep?

2006-03-05 Thread Richard Braman
'nutch-site.xml' (these configuration parameters do things like include the crawl-urlfilter.txt, pays attention to internal links, tries to not kill your host, and so on...) Steven Richard Braman wrote: Stefan, I think I know what you're saying. When you are new to nutch and you read

RE: project vitality?

2006-03-04 Thread Richard Braman
release or atleast a RC. before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8. Thanks Sudhi Doug Cutting [EMAIL PROTECTED] wrote: Richard Braman wrote: I think it is still very much at proof of concept stage. I think it is close, but as you have

RE: project vitality?

2006-03-04 Thread Richard Braman
with the way Mr. Richard Braman express his views. I have tried Nutch since version 0.3 and I could not make the 0.8 release work (Nutch is becoming a little bit complicated with all those map reduce, hadoop, and so on, that I can't deal with). I understand, however, that if a product is not finished

RE: how can i go deep?

2006-03-04 Thread Richard Braman
Try using depth=n when you do the crawl. Post crawl I don't know, but I have the same question. How do you make the index go deeper when you do your next roudn of fetching is still something I haven't figured out. -Original Message- From: Peter Swoboda [mailto:[EMAIL PROTECTED] Sent:

Moving tutorial link to wiki

2006-03-04 Thread Richard Braman
Maybe we should move the tutorial to the wiki so it can be commented on. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open Source Tax Software

RE: project vitality?

2006-03-04 Thread Richard Braman
PROTECTED] Sent: Saturday, March 04, 2006 10:54 AM To: nutch-user@incubator.apache.org Subject: RE: project vitality? I really can not agree with the way Mr. Richard Braman express his views. I have tried Nutch since version 0.3 and I could not make the 0.8 release work (Nutch

RE: project vitality?

2006-03-04 Thread Richard Braman
I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Also, if you use nutch you should

RE: how can i go deep?

2006-03-04 Thread Richard Braman
, try the whole web tutorial but change the url filter in a manner that it only crawls your webpage. This will go as deep as much iteration you run. Stefan In case you like to Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman: Try using depth=n when you do the crawl. Post crawl I don't know

RE: url shown instead of title.

2006-03-04 Thread Richard Braman
the title of the page is not shown , but instead the url is shown. The pages do have titles. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 8:43 PM To: 'nutch-user@lucene.apache.org' Subject: Any idea why Richard Braman mailto:[EMAIL

RE: project vitality?

2006-03-03 Thread Richard Braman
I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original

RE: truncation despite 0

2006-03-02 Thread Richard Braman
@lucene.apache.org Subject: Re: truncation despite 0 I had my truncation problem resolved by setting them to -1. I also set indexer.max.tokens to a larger number 5. Jay Jiang Richard Braman wrote: I am still getting content trcuncated even though i sent the size to 0 (no truncate) to http ftp and file

Index aborted crawl.

2006-02-28 Thread Richard Braman
, but if its impossible its impossible, Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open Source Tax Software

Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Richard Braman
use. private String formatDate(Calendar date) { String retval = null; if(date != null) { SimpleDateFormat formatter = new SimpleDateFormat(); retval = formatter.format(date.getTime()); } return retval; } } Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http

PDF Parse Error

2006-02-28 Thread Richard Braman
(2,0): Can't be handled as pdf document. java.io.IOException: You do not have permission to extract text I have a number of errors like this in my log, mostly the content truncated one. The thing is these files all open fine in acrobat. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002

RE: urlfilter-db plugin usage...

2006-02-28 Thread Richard Braman
I was wondering... If instead, I did a whole web crawl using the full dmoz content file, but filtered it using the urlfilter-db plugin, using my 14k urls in mysql would I obtain similar results? My gut tells me this has to be slower. Ig would put the urls in the url db, the less urls you

truncation despite 0

2006-02-28 Thread Richard Braman
I am still getting content trcuncated even though i sent the size to 0 (no truncate) to http ftp and file. some of them are getting truncated 2602 bytes. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org http://www.taxcodesoftware.org/ Free Open

RE: nutch-extensionpoints 0.71

2006-02-27 Thread Richard Braman
duplicates. -Original Message- From: Hasan Diwan [mailto:[EMAIL PROTECTED] Sent: Monday, February 27, 2006 6:45 PM To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: nutch-extensionpoints 0.71 Mr. Braman (or anyone else): On 27/02/06, Richard Braman [EMAIL PROTECTED

injecting new urls

2006-02-26 Thread Richard Braman
in the url filter. Are these commands what I should do? bin/nutch inject db newurls bin/nutch generate db segments bin/nutch fetch segments/latest_segment Should I remove irs.gov from the filter so I doesnt get done again because some of the other new urls link back to irs.gov? Richard