Re: Why did my crawl fail?

2009-07-27 Thread Paul Tomblin
Unfortunately I blew away those particular logs when I fetched the svn
trunk.  I just tried it again (well, I started it again at noon and it just
finished) and this time it worked fine, so it seems kind of heisenbug-like.
 Maybe it has something to do with which pages are types it can't handle?

On Mon, Jul 27, 2009 at 11:27 AM, xiao yang  wrote:

> Hi, Paul
>
> Can you post the error messages in the log file
> (file:/Users/ptomblin/nutch-1.0/logs)?
>
> On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin wrote:
> > Actually, I got that error the first time I used it, and then again when
> I
> > blew away the downloaded nutch and grabbed the latest trunk from
> Subversion.
> >
> > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang 
> wrote:
> >
> >> You must have crawled for several times, and some of them failed
> >> before the parse phase. So the parse data was not generated.
> >> You'd better delete the whole directory
> >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
> >> will know the exact reason why it failed in the parse phase from the
> >> output information.
> >>
> >> Xiao
> >>
> >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin
> wrote:
> >> > I installed nutch 1.0 on my laptop last night and set it running to
> crawl
> >> my
> >> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> >> > it was still running strong when I went to bed several hours later,
> and
> >> this
> >> > morning I woke up to this:
> >> >
> >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> > -activeThreads=0
> >> > Fetcher: done
> >> > CrawlDb update: starting
> >> > CrawlDb update: db: crawl.blog/crawldb
> >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> >> > CrawlDb update: additions allowed: true
> >> > CrawlDb update: URL normalizing: true
> >> > CrawlDb update: URL filtering: true
> >> > CrawlDb update: Merging segment data into db.
> >> > CrawlDb update: done
> >> > LinkDb: starting
> >> > LinkDb: linkdb: crawl.blog/linkdb
> >> > LinkDb: URL normalize: true
> >> > LinkDb: URL filter: true
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> >> > Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException:
> >> > Input path does not exist:
> >> >
> >>
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >> >
> >> >
> >> > --
> >> > http://www.linkedin.com/in/paultomblin
> >> >
> >>
> >
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
> >
>



-- 
http://www.linkedin.com/in/paultomblin


Support needed

2009-07-27 Thread sf30098

I need someone with substantial knowledge in Nutch, Java and Lucene and have
customised the system before. In particular, this should be related to image
indexing and geo-positioning. 

if possible (either or, is good as well).

The job role will be on providing supports and advice on how to go about
implementing such system..

This includes:
1. replying questions and providing guidance in implementation
2. reviewing codes and providing suggestions as to how to improve. 

Please let me know if you're interested.
-- 
View this message in context: 
http://www.nabble.com/Support-needed-tp24688172p24688172.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

2009-07-27 Thread ohaya
Hi,

I've been familiarizing myself with Nutch, in preparation for putting together 
a proof-of-concept (POC) that we are wanting.  Basically, we have some files of 
proprietary file type, and we want to be able to search on specific "fields" 
within these files.  The files are physically stored on the local filesystem.

Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch 
instance, configured for crawling the local filesystem.  These test instances 
just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just 
to allow me to get familiar with Nutch software.

Having done that, my original idea was to write some Nutch plugins that could 
be used with a Nutch crawl.

However, we already have some previously-built apps that basically "crawl" 
(e.g., they do a recursive directory search on the local filesystem) the local 
filesystem and finds all of these files.  These are Java apps that we 
previously built for various purposes.

So, I'm wondering if it might make more sense (and I think may be easier) to 
take one of those existing apps, and, basically, just enhance them to build 
Lucene indexes, which could then be used by the Nutch web app (as a web-based 
search web app)?

As I said, I'm really new to Nutch, and also to Lucene, but from what I've 
researched so far, it *looks like* it'd be fairly easy to extend some of 
existing apps to generate Lucene indexes, and I have some questions:

- If my custom Java app can be extended to "just" build indexes using Lucene, 
is that all that it needs to do in order for these to work ok with Nutch web 
app?

- Am I underestimating the effort needed to build the Lucene indexes that the 
Nutch web app could use?

I was wondering if anyone here, has had to go through a similar situation 
(Nutch plugin for custom file type vs. custom crawl app to build Lucene indexes 
that the Nutch web app can use)?  

Any other thoughts on all of this would be greatly appreciated from the 
Nutch/Lucene experts here!!

Thanks,
Jim



Re: question

2009-07-27 Thread reinhard schwab
i believe it can.
check your configuration files, nutch-site.xml and nutch-default.xml.

you will find something like


  plugin.includes
 
protocol-http|urlfilter-regex|parse-(text|html|swf|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems
with the
  underlying commons-httpclient library.
  


add to the parsers "msword".
change
parse-(text|html|swf|pdf)|
to
parse-(text|html|swf|pdf|msword)

there is a plugin in plugins folder,
which is parsing ms word documents.
parse-msword

i have not tried it so far.

Jair Piedrahita Vargas schrieb:
> Can Nutch search inside the content of an msword file? I've tried, but it 
> says "parser not found for contentType=application/msword"
> What can I do to correct this Error?
>
> Thanks
>
> JAIR PIEDRAHITA VARGAS
> Gerencia de Investigación y Nuevas Tecnologías
> Teléfono: 404   Ext 41632
> Av. los Industriales Cra 48 # 26-85 piso 6B
> BANCOLOMBIA S.A
>
>
> 
> El contenido de este mensaje puede ser información privilegiada y 
> confidencial. Si usted no es el destinatario real del mismo, por favor 
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está 
> prohibida su retención, grabación, utilización o divulgación con cualquier 
> propósito. Este mensaje ha sido verificado con software antivirus; en 
> consecuencia, el remitente de éste no se hace responsable por la presencia en 
> él o en sus anexos de algún virus que pueda generar daños en los equipos o 
> programas del destinatario.
> **
> This communication (including all attachments) may contain information that 
> is private, confidential and privileged. If you have received this 
> communication in error; please notify the sender immediately, delete this 
> communication from all data storage devices and destroy all hard copies. Any 
> use, dissemination, distribution, copying or disclosure of this message and 
> any attachments, in whole or in part, by anyone other than the intended 
> recipient(s) is strictly prohibited. This message has been checked with an 
> antivirus software; accordingly, the sender is not liable for the presence of 
> any virus in attachments that causes or may cause damage to the recipient's 
> equipment or software.
>
>   



question

2009-07-27 Thread Jair Piedrahita Vargas
Can Nutch search inside the content of an msword file? I've tried, but it says 
"parser not found for contentType=application/msword"
What can I do to correct this Error?

Thanks

JAIR PIEDRAHITA VARGAS
Gerencia de Investigación y Nuevas Tecnologías
Teléfono: 404   Ext 41632
Av. los Industriales Cra 48 # 26-85 piso 6B
BANCOLOMBIA S.A



El contenido de este mensaje puede ser información privilegiada y confidencial. 
Si usted no es el destinatario real del mismo, por favor informe de ello a 
quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, 
grabación, utilización o divulgación con cualquier propósito. Este mensaje ha 
sido verificado con software antivirus; en consecuencia, el remitente de éste 
no se hace responsable por la presencia en él o en sus anexos de algún virus 
que pueda generar daños en los equipos o programas del destinatario.
**
This communication (including all attachments) may contain information that is 
private, confidential and privileged. If you have received this communication 
in error; please notify the sender immediately, delete this communication from 
all data storage devices and destroy all hard copies. Any use, dissemination, 
distribution, copying or disclosure of this message and any attachments, in 
whole or in part, by anyone other than the intended recipient(s) is strictly 
prohibited. This message has been checked with an antivirus software; 
accordingly, the sender is not liable for the presence of any virus in 
attachments that causes or may cause damage to the recipient's equipment or 
software.


Re: Why did my crawl fail?

2009-07-27 Thread xiao yang
Hi, Paul

Can you post the error messages in the log file
(file:/Users/ptomblin/nutch-1.0/logs)?

On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin wrote:
> Actually, I got that error the first time I used it, and then again when I
> blew away the downloaded nutch and grabbed the latest trunk from Subversion.
>
> On Mon, Jul 27, 2009 at 1:11 AM, xiao yang  wrote:
>
>> You must have crawled for several times, and some of them failed
>> before the parse phase. So the parse data was not generated.
>> You'd better delete the whole directory
>> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
>> will know the exact reason why it failed in the parse phase from the
>> output information.
>>
>> Xiao
>>
>> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin wrote:
>> > I installed nutch 1.0 on my laptop last night and set it running to crawl
>> my
>> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
>> > it was still running strong when I went to bed several hours later, and
>> this
>> > morning I woke up to this:
>> >
>> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=0
>> > Fetcher: done
>> > CrawlDb update: starting
>> > CrawlDb update: db: crawl.blog/crawldb
>> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
>> > CrawlDb update: additions allowed: true
>> > CrawlDb update: URL normalizing: true
>> > CrawlDb update: URL filtering: true
>> > CrawlDb update: Merging segment data into db.
>> > CrawlDb update: done
>> > LinkDb: starting
>> > LinkDb: linkdb: crawl.blog/linkdb
>> > LinkDb: URL normalize: true
>> > LinkDb: URL filter: true
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
>> > Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException:
>> > Input path does not exist:
>> >
>> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>> > at
>> >
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>> >
>> >
>> > --
>> > http://www.linkedin.com/in/paultomblin
>> >
>>
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
>


Re: Nutch crawling status

2009-07-27 Thread caezar

I've found the script here
http://wiki.apache.org/nutch/MonitoringNutchCrawls. But I'm not sure how can
I use it, when hadoop is on the farm of 15 machines? May be I should use
hadoop tasktracker instead of this script somehow?

caezar wrote:
> 
> Hi All,
> 
> Is there a way, to retrieve nutch crawling status at runtime? Let me
> describe what I mean. For instance if currently fetch job is running, I
> want to retrieve that fetch is running, how many URLs already fetched, how
> many errors occured. Hadoop farm is used. 
> 
> Thanks for any ideas.
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-crawling-status-tp24681707p24681949.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Nutch crawling status

2009-07-27 Thread caezar

Hi All,

Is there a way, to retrieve nutch crawling status at runtime? Let me
describe what I mean. For instance if currently fetch job is running, I want
to retrieve that fetch is running, how many URLs already fetched, how many
errors occured. Hadoop farm is used. 

Thanks for any ideas.
-- 
View this message in context: 
http://www.nabble.com/Nutch-crawling-status-tp24681707p24681707.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How to index other fields in solr

2009-07-27 Thread Doğacan Güney
On Mon, Jul 27, 2009 at 09:34, Saurabh Suman wrote:
>
> I am using solr for searching.I used the class SolrIndexer.But i can search
> on content only?I want to search on author also?How to index on author?

You need to write your own query plugin. Take a look at query-basic
plugin under src/plugin.

> --
> View this message in context: 
> http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney


Re: Why did my crawl fail?

2009-07-27 Thread Paul Tomblin
Actually, I got that error the first time I used it, and then again when I
blew away the downloaded nutch and grabbed the latest trunk from Subversion.

On Mon, Jul 27, 2009 at 1:11 AM, xiao yang  wrote:

> You must have crawled for several times, and some of them failed
> before the parse phase. So the parse data was not generated.
> You'd better delete the whole directory
> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
> will know the exact reason why it failed in the parse phase from the
> output information.
>
> Xiao
>
> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin wrote:
> > I installed nutch 1.0 on my laptop last night and set it running to crawl
> my
> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> > it was still running strong when I went to bed several hours later, and
> this
> > morning I woke up to this:
> >
> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl.blog/crawldb
> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: true
> > CrawlDb update: URL filtering: true
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl.blog/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> > LinkDb: adding segment:
> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> > Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException:
> > Input path does not exist:
> >
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
> >
>



-- 
http://www.linkedin.com/in/paultomblin


Re: How to index other fields in solr

2009-07-27 Thread Paul Tomblin
Wouldn't that be using facets, as per
http://wiki.apache.org/solr/SimpleFacetParameters


On Mon, Jul 27, 2009 at 2:34 AM, Saurabh Suman
wrote:

>
> I am using solr for searching.I used the class SolrIndexer.But i can search
> on content only?I want to search on author also?How to index on author?
> --
> View this message in context:
> http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
http://www.linkedin.com/in/paultomblin


Re: crawl-tool.xml

2009-07-27 Thread reinhard schwab
its not only confusing me,
its also confusing the author,  FrankMcCown, of the nutch tutorial

http://wiki.apache.org/nutch/NutchTutorial


Crawl Command: Configuration

To configure things for the crawl command you must:

*

  Create a directory with a flat file of root urls. For example, to
  crawl the nutch site you might start with a file named urls/nutch
  containing the url of just the Nutch home page. All other Nutch
  pages should be reachable from this page. The urls/nutch file
  would thus contain:

   http://lucene.apache.org/nutch/ 

*

  Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
  with the name of the domain you wish to crawl. For example, if you
  wished to limit the crawl to the apache.org domain, the line
  should read:

   +^http://([a-z0-9]*\.)*apache.org/ 

  This will include any url in the domain apache.org.

* Until someone could explain this...When I use the file
crawl-urlfilter.txt the filter doesn't work, instead of it use the file
conf/regex-urlfilter.txt and change the last line from "+." to "-."


reinhard schwab schrieb:
> i have tried the recrawl script of susam pal and have wondered why
> url filtering no longer works.
> http://wiki.apache.org/nutch/Crawl
>
> the mystery is
>
> only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
>
> Configuration conf = NutchConfiguration.create();
> conf.addResource("crawl-tool.xml");
>
> Fetcher.java and all the other tools which filter the outlinks do not
> add this.
> this is really confusing me and i have spent some time to figure this out.
>
> regards
> reinhard
>
>
>
>
>
>
>
>
>