RE: Document Classification - indexing question

2007-05-08 Thread Armel T. Nene
Bastian,

When trying to classify document using the approach of dynamic
classification, depending on the file type Nutch can take a awhile to parse
the data. While working with Nutch I have encountered some null pointer
exception due to parsing processes. This is due to a Hadoop configuration
that was not made available in Nutch-default.xml file. The settings should
allow Nutch to increase the time that hadoop have to wait before setting a
process as inactive. 

Some questions that you should investigate is how will your classification
process handles failed parsed and what about if the data is not parsed in a
text format (i.e. unsupported file type)? What happens to the index being
created if the classification fails; corrupted? In a multithreaded
environment such as Nutch, what happens to the concurrent classification
processes, mixed up data? I have a problem with Nutch now it seems not to be
able to generate dynamic fields based on documents while using more than a
single threads. The index becomes corrupted with mixed data from different
files in the wrong Lucene document. There are many other questions once you
start to work on your classification project.

Best regards

Armel

-Original Message-
From: Bastian Preindl [mailto:[EMAIL PROTECTED] 
Sent: 08 May 2007 13:38
To: nutch-dev@lucene.apache.org
Subject: Re: Document Classification - indexing question

Hi Armel,

thanks for you quick reply!

 I have been working on a similar project for the last couple of months but
I
 am taking a slightly different approach. Because fetching - parsing  -
 indexing can be time consuming and in my case, I also need the
unclassified
 indexes. Using classification algorithm and the Lucene API, I build
 classified indexes by using the first index as corpus. 
   

This is definitely a good idea and a somewhat other approach as it moves 
the classification task out of Nutch and into Lucene. Are there any 
frameworks/plugins already available for applying document 
classification within Lucene? The much faster parsing and indexing 
process within Nutch if no online classification takes places stands 
against the disk space consumption which is some thousand times greater 
when indexing all parsed documents instead of indexing only the 
positively classified ones.

 Maybe we should discuss together on skype or MSN let me know. My skype is
 etapix.
   

That would be really nice, thanks for the offer! I'll let you know my 
MSN-nummer after I've created an account.

Best regards

Bastian





Nutch ERROR parse.OutlinkExtractor - getOutlinks

2007-04-17 Thread Armel T. Nene
Hi guys,

 

I have been running successfully recently with most of the plug-ins enabled.
Lately, I have been trying to index some xml files which has some strings in
the form of ftawi:xyz. 

 

Nutch version 8.2-dev on MS Windows Server 2003

 

During Outlinks extractor I get the following errors:

 

2007-04-17 21:52:51,598 ERROR parse.OutlinkExtractor - getOutlinks

java.net.MalformedURLException: unknown protocol: ftawi

at java.net.URL.init(Unknown Source)

at java.net.URL.init(Unknown Source)

at java.net.URL.init(Unknown Source)

at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78
)

at org.apache.nutch.parse.Outlink.init(Outlink.java:35)

at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:11
1)

at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70
)

at
org.apache.nutch.parse.stellent.StellentParser.getParse(StellentParser.java:
53)

at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)

at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:283)

at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)

 

I get the same error with all the parser plug-ins when running over the same
xml files. Can you let me know if there is a way of using the regular
expression to let the application know what kind of url should be included
in the url. Also, Nutch should not crash if the url in the outlink is not
valid. Is there any other HTML parser in Nutch that I can try. 

 

Awaiting your kind reply.

 

Regards,

 

Armel

 

===

Armel T. Nene

iDNA Solutions LTD

Tel: +44 (20) 7257 6124

Mobile: +44 (7886)950 483 

Web:  http://www.idna-solutions.com http://www.idna-solutions.com

Blog:  http://blog.idna-solutions.com http://blog.idna-solutions.com

 



Nutch java.io.exception

2007-04-10 Thread Armel T. Nene
  crawl.Injector - Injector: done

2007-04-05 16:35:34,439 INFO  crawl.Generator - topN: 100

2007-04-05 16:35:34,439 DEBUG conf.Configuration - java.io.IOException:
config()

at
org.apache.hadoop.conf.Configuration.init(Configuration.java:67)

at
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:50)

at org.apache.nutch.crawl.Generator.main(Generator.java:416)

at
com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80
)

at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)

 

2007-04-05 16:35:34,443 INFO  conf.Configuration - parsing
jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ha
doop-default.xml

2007-04-05 16:35:34,450 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml

2007-04-05 16:35:34,462 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml

2007-04-05 16:35:34,468 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: starting

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: segment:
test/segments/20070405163535

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.

2007-04-05 16:35:35,471 DEBUG conf.Configuration - java.io.IOException:
config(config)

at
org.apache.hadoop.conf.Configuration.init(Configuration.java:76)

at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86)

at org.apache.hadoop.mapred.JobConf.init(JobConf.java:97)

at org.apache.nutch.util.NutchJob.init(NutchJob.java:26)

at
org.apache.nutch.crawl.Generator.generate(Generator.java:309)

at org.apache.nutch.crawl.Generator.main(Generator.java:417)

at
com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80
)

at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)

 

===

Armel T. Nene

iDNA Solutions LTD

Tel: +44 (20) 7257 6124

Mobile: +44 (7886)950 483 

Web: http://www.idna-solutions.com

Blog: http://blog.idna-solutions.com

 



RE: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Armel T. Nene
Dennis

I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
checkout awhile ago from SVN but never updated again. I was able to crawl
1 xml files before with no error whatsoever. This is the following
errors that I get when I'm fetching:

INFO parser.custom: Custom-parse: Parsing content
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
java.lang.NullPointerException
07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
pages/s, 0 kb/s, 
07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException

One of the problem is that my hadoop version says the following:
hadoop-0.4.0-patched. Now I don't know if it means that I am running the
0.4.0 version but it seems a little bit confusing. Once you can clarify that
for me, then I will be able to apply the patch to my version. 

Best Regards,

Armel

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: 13 February 2007 21:09
To: nutch-dev@lucene.apache.org
Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

Actually I take it back.  I don't think it is the same problem but I do 
think it is the right solution.

Dennis Kubes

Dennis Kubes wrote:
 This has to do with HADOOP-964.  Replace the jar files in your Nutch 
 versions with the most recent versions from Hadoop.  You will also need 
 to apply NUTCH-437 patch to get Nutch to work with the most recent 
 changes to the Hadoop codebase.
 
 Dennis Kubes
 
 Gal Nitzan wrote:
 Hi,

 Does anybody uses Nutch trunk?

 I am running nutch 0.9 and unable to fetch.

 after 50-60K urls I get NPE in
 org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.

 I was wandering if anyone have a work around or maybe something is 
 wrong with
 my setup.

 I have opened a new issue in jira
 http://issues.apache.org/jira/browse/hadoop-1008 for this.

 Any clue?

 Gal



-- 
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007
13:23
 

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007
13:23
 



Nutch error messages

2007-02-06 Thread Armel T. Nene
Hi guys,

I wrote a parser for parsing proprietary file formats. The plugin used to work 
until recently. Now when I try to parse simple CAD files I get the following 
error messages:

INFO  fetcher.Fetcher - fetching 
file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/(00)E9~161394764(1).PDF
WARN  fetcher.Fetcher - Error parsing: 
file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/(00)E9~161394764(1).PDF:
 failed(2,200): java.lang.NullPointerException

There are some debug lines in the parser but they don't get log in the log 
file. Also when I set the log level to DEBUG, I have the following messages:

INFO  fetcher.Fetcher - fetching 
file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg
DEBUG file.File - fetching 
file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg
DEBUG parse.ParserFactory - Could not clean the content-type [], Reason is 
[org.apache.nutch.util.mime.MimeTypeException: The type can not be null or 
empty]. Using its raw version...
DEBUG parse.ParserFactory - ParserFactory:No parse plugins mapped or enabled 
for contentType 
DEBUG parse.ParseUtil - Parsing 
[file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg]
 with [EMAIL PROTECTED]
WARN  fetcher.Fetcher - Error parsing: 
file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg:
 failed(2,200): java.lang.NullPointerException

If anybody can make sense of the errors please guide me on this. Also, I have 
disabled most of Nutch parsers to use my custom as it can parse many formats. I 
am awaiting any help from the community.

Regards,

Armel



RE: Fetcher2

2007-01-25 Thread Armel T. Nene
Kauu,

The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com
-Original Message-
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 09:31
To: nutch-dev@lucene.apache.org
Subject: Re: Fetcher2

please give us the url,thx

On 1/25/07, chee wu [EMAIL PROTECTED] wrote:

 Just appended the portion for .81  to NUTCH-339

 - Original Message -
 From: Armel T. Nene [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Thursday, January 25, 2007 8:06 AM
 Subject: RE: Fetcher2


  Chee,
 
  Can you make the code available through Jira.
 
  Thanks,
 
  Armel
 
  -
  Armel T. Nene
  iDNA Solutions
  Tel: +44 (207) 257 6124
  Mobile: +44 (788) 695 0483
  http://blog.idna-solutions.com
 
  -Original Message-
  From: chee wu [mailto:[EMAIL PROTECTED]
  Sent: 24 January 2007 03:59
  To: nutch-dev@lucene.apache.org
  Subject: Re: Fetcher2
 
  Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
 I
  can share the code,if any one want to use ..
  - Original Message -
  From: Andrzej Bialecki [EMAIL PROTECTED]
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 23, 2007 12:09 AM
  Subject: Re: Fetcher2
 
 
  chee wu wrote:
  Fetcher2 should be a great help for me,but seems can't integrate with
  Nutch81.
  Any advice on how to use it based on .81?
 
 
  You would have to port it to Nutch 0.8.1 - e.g. change all Text
  occurences to UTF8, and most likely make other changes too ...
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 




-- 
www.babatu.com



Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Hi guys,

 

I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

 

* How does the generator knows which file to crawl again?

oIs it looking at the fetch time?

oThe modified date as this can be misleading?

 

There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?

 

Here is an extract from my 2 crawls:

 

http://dmoz.org/Arts/   Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:45:43 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 0.013471641

Signature: fe52a0bcb1071070689d0f661c168648

Metadata: null

 

file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
Version: 4

Status: 2 (DB_fetched)

Fetch time: Sat Feb 24 10:31:44 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.1035091E-4

Signature: 57254d9ca2988ce1bf7f92b6239d6ebc

Metadata: null

 

Looking forward to your reply.

 

Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Chee,

Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
version, was able to apply fully but not entirely successful in running with
the XML parser plugin. If you have applied successfully let me know.

Regards,

Armel 
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 13:44
To: nutch-dev@lucene.apache.org
Subject: Re: Modified date in crawldb

I also had this question a few days ago,and I am using Nutch0.8.1.It seems
the Modified data will be used by Nutch-61, you can find detail at the
link below: 
 http://issues.apache.org/jira/browse/NUTCH-61

I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
this.
1.Retrieve all the Date information contained in the page content, Regular
Expression is used to identify the date information.
2.Chose the newest date got as the page modified date.
3.Call  the method of  setModifiedTime( )  of the crawlDataum object in
FetcherThread.Output( ).
Maybe you can use a parse filter to separate this function from the core
code.
I am also new to Nutch, if  anything  wrong ,please feel free point out.


- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb


 Hi guys,
 
 
 
 I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
actually
 save the last modified date of files. I have run a crawl on my local file
 system and the web. When I dumped the content of crawldb for both crawl,
the
 modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
it's
 intended to be as is or if it's a bug. Therefore my question is:
 
 
 
 * How does the generator knows which file to crawl again?
 
 oIs it looking at the fetch time?
 
 oThe modified date as this can be misleading?
 
 
 
 There is a modified date returned in most http headers and files on file
 system all have modified date which is the last modified date. How come
it's
 not stored in the crawldb?
 
 
 
 Here is an extract from my 2 crawls:
 
 
 
 http://dmoz.org/Arts/   Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Thu Feb 22 12:45:43 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 0.013471641
 
 Signature: fe52a0bcb1071070689d0f661c168648
 
 Metadata: null
 
 
 
 file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
 Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Sat Feb 24 10:31:44 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 1.1035091E-4
 
 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
 
 Metadata: null
 
 
 
 Looking forward to your reply.
 
 
 
 Regards,
 
 
 
 Armel
 
 
 
 -
 
 Armel T. Nene
 
 iDNA Solutions
 
 Tel: +44 (207) 257 6124
 
 Mobile: +44 (788) 695 0483 
 
 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
 
 
 




threads-safe methods in Nutch

2007-01-25 Thread Armel T. Nene
Hi guys,

 

I know it's me again. I have been testing Nutch robustly lately and here
some threads issues that I found.

I am running version 0.8.2-dev. When Nutch is initially run (either from
script or ANT), it has a default of 10 threads for the fetcher. This is
actually good for performance reason as large number of urls can be indexed
fast enough. The problem is some plugins are not thread safe (or is it the
fetcher that's not thread-safe).

 

I am running the parse-xml plugin (Nutch-185) and some issues:

 

When running multiple threads such as the default 10 threads, I have some
inconsistency with the stored fields and values. I found out the first 6
documents will be indexed without problem and then 4 with errors, 4 correct
and x numbers with errors and so forth. At first I couldn't see where the
problem was, and after several debugging activities, I realize that it could
be a threading issue. I run Nutch with the minimum threading of 1 and the
fields were stored without any issues.

 

I don't know how to conclude this but I think that the methods that Nutch
uses for threading are not thread safe. I could be wrong therefore I am
awaiting any reply.

 

Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: Fetcher2

2007-01-24 Thread Armel T. Nene
Chee,

Can you make the code available through Jira.

Thanks,

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 24 January 2007 03:59
To: nutch-dev@lucene.apache.org
Subject: Re: Fetcher2

Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy... I
can share the code,if any one want to use ..
- Original Message - 
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 23, 2007 12:09 AM
Subject: Re: Fetcher2


 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with
Nutch81.
 Any advice on how to use it based on .81? 
   
 
 You would have to port it to Nutch 0.8.1 - e.g. change all Text 
 occurences to UTF8, and most likely make other changes too ...
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 




How to modify crawldb values

2007-01-23 Thread Armel T. Nene
Hi guys,

 

I want to extend Nutch to use real-time indexing on local file system. I
have been through the source code to find out ways to modify values stored
in CrawlDB. The idea is simple:

 

I have an external program (or a script) which checks for changes in a
directory (url injected in the crawldb). When there are new changes
recorded, the program will update the status in the crawldb and generate a
new fetch list for the fetcher to fetch. I do not want to make great changes
to the nutch source code as I want the program to be compatible with future
releases. Now, I know the crawldatum is saved in the crawldb with the url. I
am not too sure but I think the url is the key to retrieve the crawldatum.
For my program to work successfully, I need to know the following:

 

* How to read data from the crawldb; what data structure does it use
and how to referenced to it?

* How to write back to the crawldb; updating information back to the
crawldb or probably creating a new with changed and unchanged values.

 

This is an extract from the crawldb:

 

http://some-url.com/Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:44:05 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.0323955

Signature: f4c14c46074b66aad8829b8aa84cd636

Metadata: null

 

How can get this information with an external program and modify/ update it.
Once I know how to implement that part, I can call nutch in the usual way of
generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
have the new value that I want re-indexed. This will stop the fetcher from
fetching a long list of urls (changed or unchanged but need fetching because
of their next_fetch_time is due). The program gets its update from the
underlying OS to know notify about any changes to files and folders being
monitored. Once the program is working with sufficient tests, I will be
willing to share the source code; it's written in java and doesn't need any
script to launch nutch.

 

I will be looking forward to your kind support.

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: How to modify crawldb values

2007-01-23 Thread Armel T. Nene
Thanks for the reply, I 'll try this and if I encounter any problem I'll
send another email. This will be a good feature to have and probably will
avoid the project into branching in different subprojects.

Regards,

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com
-Original Message-
From: Doğacan Güney [mailto:[EMAIL PROTECTED] 
Sent: 23 January 2007 15:06
To: nutch-dev@lucene.apache.org
Subject: Re: How to modify crawldb values

Hi,

Armel T. Nene wrote:
 Hi guys,

  

 I want to extend Nutch to use real-time indexing on local file system. I
 have been through the source code to find out ways to modify values stored
 in CrawlDB. The idea is simple:

  

 I have an external program (or a script) which checks for changes in a
 directory (url injected in the crawldb). When there are new changes
 recorded, the program will update the status in the crawldb and generate a
 new fetch list for the fetcher to fetch. I do not want to make great
changes
 to the nutch source code as I want the program to be compatible with
future
 releases. Now, I know the crawldatum is saved in the crawldb with the url.
I
 am not too sure but I think the url is the key to retrieve the crawldatum.
 For my program to work successfully, I need to know the following:

  

 * How to read data from the crawldb; what data structure does it
use
 and how to referenced to it?
   

Crawldb is essentially a list of url, CrawlDatum pairs and is stores 
as a MapFile. So you can read it with MapFile.Reader.get.
 * How to write back to the crawldb; updating information back to
the
 crawldb or probably creating a new with changed and unchanged values.
   
Current FS implementation is write-once, so you can't modify it. But you 
can read it one-by-one(possibly with MapFile.Reader.next) and then write 
a new one with MapFile.Writer.

  

 This is an extract from the crawldb:

  

 http://some-url.com/Version: 4

 Status: 2 (DB_fetched)

 Fetch time: Thu Feb 22 12:44:05 GMT 2007

 Modified time: Thu Jan 01 01:00:00 GMT 1970

 Retries since fetch: 0

 Retry interval: 30.0 days

 Score: 1.0323955

 Signature: f4c14c46074b66aad8829b8aa84cd636

 Metadata: null

  

 How can get this information with an external program and modify/ update
it.
 Once I know how to implement that part, I can call nutch in the usual way
of
 generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
 have the new value that I want re-indexed. This will stop the fetcher from
 fetching a long list of urls (changed or unchanged but need fetching
because
 of their next_fetch_time is due). The program gets its update from the
 underlying OS to know notify about any changes to files and folders being
 monitored. Once the program is working with sufficient tests, I will be
 willing to share the source code; it's written in java and doesn't need
any
 script to launch nutch.

  

 I will be looking forward to your kind support.

  

 Armel

  

 -

 Armel T. Nene

 iDNA Solutions

 Tel: +44 (207) 257 6124

 Mobile: +44 (788) 695 0483 

  http://blog.idna-solutions.com/ http://blog.idna-solutions.com

  


   





is crawldb format in Nutch 0.8 compatible with Nutch0.7

2007-01-23 Thread Armel T. Nene
Hi guys,

 

I am running in some nightmares when trying to iterate over values in the
Nutch 0.8.2 crawldb. I am getting some hadoop exception such as the
following:

 

07/01/23 18:33:56 INFO conf.Configuration: parsing
jar:file:/C:/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/hadoop-default.xm
l

07/01/23 18:33:56 INFO conf.Configuration: parsing
jar:file:/C:/nutch-0.8.2-dev/nutch-0.8.2-dev.jar!/nutch-default.xml

07/01/23 18:33:56 INFO conf.Configuration: parsing
jar:file:/C:/nutch-0.8.2-dev/nutch-0.8.2-dev.jar!/nutch-site.xml

Exception in thread main java.lang.ArithmeticException: / by zero

at
org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.ja
va:33)

at
org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.ja
va:88)

at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:321)

therefore, if I can iterate over the values contained in the crawldb using
Nutch 0.7 API, I should think this will fix the issue. So the question is;

 

is Nutch 0.8 backward compatible with Nutch 0.7.2

 

Thanks,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



java.lang.IllegalStateException

2007-01-19 Thread Armel T. Nene
Hi guys,

 

I am using Nutch 0.8.1, for the past 2 days I have been getting the
following exception: Java.Lang.IllegalStateException. The exception started
after I implementing the Nutch-61 patch; Adaptive Re-crawl Interval. In
short, this happens:

 

I am trying to crawl XML files (locally and remotely on a web server), once
crawl, the fetcher sends the file to their processing parsers. This is where
the exception is thrown as the parsers launches but do not perform any
activity on the file. If anybody has dealt with this type error, please let
me know how to get rid of it. Below is an extract from my log file.

 

2007-01-18 14:16:16,371 INFO  parse.xml - XMLParser config path : ..

2007-01-18 14:16:16,371 INFO  parse.xml - XMLParser config path : ..

2007-01-18 14:16:16,371 WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

2007-01-18 14:16:16,371 WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_11.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

2007-01-18 14:16:16,387 INFO  parse.xml - XMLParser config path : ..

2007-01-18 14:16:16,403 WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_13.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

2007-01-18 14:16:16,403 INFO  parse.xml - XMLParser config path : ..

2007-01-18 14:16:16,403 WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_14.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

2007-01-18 14:16:16,418 INFO  parse.xml - XMLParser config path : ..

2007-01-18 14:16:16,418 WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_10.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

2007-01-18 14:16:17,887 INFO  fetcher.Fetcher - Fetcher: done

 

If a root element is not set within an XML file, a nullpointer exception is
thrown not an illegalstateexception. #can anyone put some lights on this
error. 

 

Thanks.

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: Next Nutch release

2007-01-17 Thread Armel T. Nene
Hi guys,

 

I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting
unmodified content applying it to Nutch 0.8.1. Here are some points:

 

1.This feature is great for Nutch to have has it differentiate between
modified and unmodified content, therefore not indexing twice even if the
document fetch time has arrived.

a.There are some performance issues here. Even with this patch, Nutch
still fetches the content and then checks its status against the last
modified time in the database. If it has to check for a 1000 files before
indexing the following 10 files, this will cause a real problem for those
that are after real time indexing.

 

2.Since, I applied this patch to Nutch 0.8.1, when I try to parse xml
files with our modified version of the xmlparser /indexer plugin; the
fetcher throws the following exception:

 

WARN  fetcher.Fetcher - Error parsing:
file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200):
java.lang.IllegalStateException: Root element not set

 

The system will not hang or crash but the xml file will be indexed without
any generated fields. The plugins works fine without the patch. I have
another parser that parses graphics and other formats that fails when used
with the patch. So far this problem occurs when using the file protocol.

 

3.the patch works fine when indexing web site using the http protocol.

 

I am willing to work with Andrzej to make it stable as I understand it's the
architect of this patch. I have the possibility of testing it in a mix
environment in our computer lab. This patch can be the stepping stone for
other features such real time indexing and fetch queue for index updating as
opposed to creating a new index each time.

 

Best Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

http://blog.idna-solutions.com

-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: 17 January 2007 15:39
To: nutch-dev@lucene.apache.org
Subject: Re: Next Nutch release

 

Sami Siren wrote:

 2007/1/17, Enis Soztutar [EMAIL PROTECTED]:

 

 Hi all, for NUTCH-251:

 

 I suppose that NUTCH-251 is relatively a significant issue by the votes.

 Stafan has written a good plugin for the admin gui and i have updated it

 to work with nutch-0.8, hadoop 0.4.

 

 

 Good to hear someone is working on that! Why not target it to

 trunk version of Nutch?

It is targetted to the trunk already. The previous was targetted to 

nutch-0.8, hadoop 0.4, since back then that versions was the latest in 

the trunk

 

  - a web server to serve plugin jsp's

 

 Why not make it regular war? also please consider making a clean

 separation of view/logic when you implement the web ui.

As Stafan's version used embedded Jetty server, I continued this way. 

But i will consider that possibility also.

 

 

 -- 

 Sami Siren

 

 

 



protocol-smb: a new protocol plugin for Windows Shares

2007-01-05 Thread Armel T. Nene
Hi guys,

 

We've developed a plugin  http://issues.apache.org/jira/browse/NUTCH-427

This plugin allows you to crawl MS Windows Share. It uses a property files
to read user

credentials.

 

 

We'd appreciate community feedback to these issues, and possible inclusion
in future versions.

 

 

 

Best regards,

 

Armel T. Nene

 



Nutch site crawling

2006-12-07 Thread Armel T. Nene
Hi,

 

Is it possible to let Nutch crawl a set of documents at a time?

 

I have set-up Nutch with the following option:

 

topN 20

 

depth 2

 

Therefore I wanted Nutch to crawl my directory and just as deep as 2 links
from the root directory. Now the root directory itself contains more than 20
files but my understanding of the topN is to make the crawler fetch 20
documents and then index. At the next crawl, the it chooses another 20 files
from the directory and fetches and indexex them.

 

My problem is that when Nutch crawls, it keeps on fetching the same files
over and over again. That is a severe issue in my case because I have to run
Nutch on some directory with more than 100 GB of data. It is more efficient
to crawl a small set of files at a time to index than try to fetch all the
data before indexing. Can you let me a workaround this? Or just let me know
what I am doing wrong. 

 

Thanks in advance.

 

Regards,

 

Armel



Nutch Re-crawl same file over and over again

2006-12-06 Thread Armel T. Nene
Hi,

I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth
2. But when Nutch re-crawls, it re-crawls the same files over and over
again. The directory doesn't contain any other sub-directories, can someone
let me what might be the cause. There are more than 20 files in the
directory so why nutch only getting the same twenty files?

Thanks,

Armel


-Original Message-
From: Michael Stack [mailto:[EMAIL PROTECTED] 
Sent: 06 December 2006 16:04
To: Shay Lawless
Cc: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org;
[EMAIL PROTECTED]
Subject: Re: [Archive-access-discuss] Full List of Metadata Fields

Hey Shay.

Some friendly advice.  Cross-posting a question will make you unpopular 
fast.   Its best to start on the most appropriate seeming list and only 
move on from there if you are getting no satisfaction.  The below 
question looks best at home over on the archive-access list.  Let me 
have a go at answering it there.

Yours,
St.Ack 


Shay Lawless wrote:
 Hi all,

 I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 
 0.5.0-200611082313) to Index a collection of ARC files generated by a 
 web crawl using the Heritrix web crawler (Version 1.4.0).

 When I check the metadata tag on the wera front-end the following list 
 of tags are displayed

 ARC Identifier
 URL
 Time of Archival
 Last Modified Time
 Mime-Type
 File Status
 Content Checksum
 HTTP Header

 When I click on the explain link in the NutchWax front-end the 
 following list of tags are displayed

 Segment
 Digest
 Date
 ARCDate
 Encoding
 Collection
 ARCName
 ARCOffset
 ContentLength
 PrimaryType
 subType
 URL
 Title
 Boost

 Is there a full list of the metadata fields that NutchWax/Nutch 
 creates when indexing? I'm particularly interested in tags relating to 
 the actual content on each page i.e. content type, description etc etc
 When searching does NutchWax/Nutch search across such tags or just 
 across the parsed text of each page for occurances of keywords etc?

 Any help you can provide would be greatly appreciated!

 Shay
  
 

 -
 Take Surveys. Earn Cash. Influence the Future of IT
 Join SourceForge.net's Techsay panel and you'll get the chance to share
your
 opinions on IT  business topics through brief surveys - and earn cash
 http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
 

 ___
 Archive-access-discuss mailing list
 [EMAIL PROTECTED]
 https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
   





RE: Indexing and Re-crawling site

2006-12-05 Thread Armel T. Nene
Lukas,

 

I was wondering about running Nutch as Windows Services. I was able to
implement it as follow:

 

1.Creating a java program that will act as a Nutch and Launcher and
re-crawler.

2.Download JavaService from http://javaservice.objectweb.org/

3.Follow the tutorial to turn your java program in a Window service

 

I then was able to test it on Windows Server 2003 and XP. It works fine. If
you want to me to post the code let me know, maybe others can use it too.

 

Regards,

 

Armel

 

-Original Message-
From: Lukas Vlcek [mailto:[EMAIL PROTECTED] 
Sent: 04 December 2006 22:12
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing and Re-crawling site

 

Hi,

 

I will try to use my out-dated knowledge to answer ( confuse you on) your

items:

 

1.  Why does nutch has to create a new index every time when indexing,

 while it can just merge it with the old existing index? I try to change

 the

 value in the IndexMerger class to 'false' while creating an index

 therefore

 Lucene doesn't recreate a new index each time it is indexing. The problem

 with this is, I keep on having some exception when it tries to merge the

 indexes. There is a lock time out exception that is thrown by the

 IndexMerger. And consequently the index that get created. Is it possible

 to

 let nutch index by merging it with an existing index? I have to crawl

 about

 100Gb of data and if there are only a few documents that have been

 changed,

 I don't nutch to recreate a new index because of that but update the

 existing index by merging it with the new one. I need some light on this.

 

 

This is more for Nutch experts but to me it seems that new index is

reasonable. Besides others it means that original index is still searchable

while the new index is being created (creating a new index can take long

time based on your settings). Updating one document at a time in large index

is not very optimal approach I think.

 

2.  What is the best way to make nutch re-crawl? I have implemented a

 class that loops the crawl process; it has a crawl interval which is set

 in

 a property file and a running status. The running status is a Boolean

 variable which is set to true if the re-crawl process is ongoing or false

 if

 it should stop. But with this approach, it seems that the index is not

 being

 fully generated. The values in the index cannot be queried. The re-crawl

 is

 in java which calls an underlying ant script to run nutch. I know most

 re-crawl are written as batch script but can you tell me which one do you

 recommended? A batch script or a loop-based java program?

 

 

I used to use batch and was happy with it.

 

3.  What is the best way of implementing nutch has a window service or

 unix daemon?

 

 

Sorry - what do you mean byt this?

 

Regards,

Lukas



Indexing and Re-crawling site

2006-11-28 Thread Armel T. Nene
Hi guys,

 

I have a few questions regarding the way nutch indexes and the best way a
recrawl can be implemented. 

 

1.  Why does nutch has to create a new index every time when indexing,
while it can just merge it with the old existing index? I try to change the
value in the IndexMerger class to 'false' while creating an index therefore
Lucene doesn't recreate a new index each time it is indexing. The problem
with this is, I keep on having some exception when it tries to merge the
indexes. There is a lock time out exception that is thrown by the
IndexMerger. And consequently the index that get created. Is it possible to
let nutch index by merging it with an existing index? I have to crawl about
100Gb of data and if there are only a few documents that have been changed,
I don't nutch to recreate a new index because of that but update the
existing index by merging it with the new one. I need some light on this.

 

2.  What is the best way to make nutch re-crawl? I have implemented a
class that loops the crawl process; it has a crawl interval which is set in
a property file and a running status. The running status is a Boolean
variable which is set to true if the re-crawl process is ongoing or false if
it should stop. But with this approach, it seems that the index is not being
fully generated. The values in the index cannot be queried. The re-crawl is
in java which calls an underlying ant script to run nutch. I know most
re-crawl are written as batch script but can you tell me which one do you
recommended? A batch script or a loop-based java program?  

 

3.  What is the best way of implementing nutch has a window service or
unix daemon?

 

Thanks,

 

Armel



RE: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Armel T. Nene
I agree with you that documentation is vital not the just extending the
current version but also for any plugins and patches created. I have been
spending almost two weeks trying to adapt nutch to my project but I spend
more time in reading code and trying to understand what they do before I can
even start to fix problem. Come on guys, documentation is good coding
practice, we can't read your mind to know exactly what you were trying to
achieve by just looking at the implementation code.

This is just a good constructive criticism.

:) Armel

-Original Message-
From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: 25 November 2006 03:45
To: nutch-dev@lucene.apache.org
Subject: [jira] Created: (NUTCH-408) Plugin development documentation

Plugin development documentation


 Key: NUTCH-408
 URL: http://issues.apache.org/jira/browse/NUTCH-408
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
 Environment: Linux Fedora
Reporter: nutch.newbie


Documentation is rare! But very vital for extending current (0.9) nutch.
Current docs on the wiki for 0.7 plugin development was good but it doesn't
apply to 0.9 and new developers who are joining directly 0.9 find the 0.7
documentation not enough. A more practical plugin writing documentation for
0.9 is desired also exposing the plugin principals in practical terms i.e.
extension points and libs etc. furthermore it would be good to provide some
best practice example i.e. 

look for the lib you are planning to use if its already in lib folder and
maybe that version of the external lib is good for the plugin dev rather
then using another version things like that..

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira






RE: Nutch folder configuration

2006-11-21 Thread Armel T. Nene
Also can Nutch be run as a Windows services. Let me know so that I don't
waste my time trying to code something that won't work.

-Original Message-
From: Armel T. Nene [mailto:[EMAIL PROTECTED] 
Sent: 21 November 2006 21:56
To: nutch-dev@lucene.apache.org
Subject: Nutch folder configuration

Hi all,

 

I want to configure Nutch so that I can have various folders such as: conf,
crawldb and index stored on different drive. So far, it keeps on giving me
the following error:

 

ERROR mapred.JobClient: Input directory C:/omittted/omitted/testcrawl/urls
in local is invalid. Is Nutch always looking for folders in its current
directory? I am also writing a java client to be able to launch Nutch
without the script so that it can be wrapped as Windows services. I am
having problem with Nutch classpath, can you wise me up on that issue too.
But first how can let Nutch know that the folders are stored in different
location. The settings for the folders are loaded from a property file and
the values are passed to Generator, Injector, Fetcher and Indexer but stills
has problem with it. I am looking forward to good tip on this.

 

Armel




RE: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.

2006-11-21 Thread Armel T. Nene
Rida,

There is something I would like to clarify, when using a namespace and xpath
to store content in the index, can this be seen as multi-fields. For example
if we are storing customer name and customer address which are been declared
in a xml configuration file, is that multi-field. Please explain, sorry I am
quite new to the Nutch architecture.

Armel 

-Original Message-
From: Rida Benjelloun (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: 20 November 2006 22:16
To: nutch-dev@lucene.apache.org
Subject: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser
plugin.

[
http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12451452
] 

Rida Benjelloun commented on NUTCH-185:
---

Nutch doesn't support multifieds values, so I decided to merge the content
in the same field. If you want to search the field you should index it as
Text instead of keyword.



 XMLParser is configurable xml parser plugin.
 

 Key: NUTCH-185
 URL: http://issues.apache.org/jira/browse/NUTCH-185
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, indexer
Affects Versions: 0.7.2, 0.8.1, 0.8
 Environment: OS Independent
Reporter: Rida Benjelloun
 Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip


 Xml parser  is configurable plugin. It use XPath and namespaces to do the
mapping between the XML elements and Lucene fields. 
 Informations :
 1- Copy xmlparser-conf.xml to the nutch/conf dir
 2- To index your custom XML file, you have to modify the
xmlparser-conf.xml. 
 This parser uses namespaces and XPATH to parse XML content
 The config file do the mapping between the XML noeds (using XPATH) and
lucene field. 
 Example : field name=dctitle xpath=//dc:title type=Text boost=1.4
/ 
 3- The xmlIndexerProperties encapsulate a set of fields associated to a
namespace. 
 If the namespace is found in the xml document, the fields represented by
the namespace will be indexed.
 Example : 
 xmlIndexerProperties type=filePerDocument namespace=
http://purl.org/dc/elements/1.1/;
   field name=dctitle xpath=//dc:title type=Text boost= 1.4 / 
   field name=dccreator xpath=//dc:creator type=keyword boost= 1.0
/ 
 /xmlIndexerProperties
 4- It is possible to define a default namespace that will be applied when
the parser 
 didn't find any namespace in the document or when the namespace found in
the xml document doesn't match with the namespace defined in the
xmlIndexerProperties. 
 Example :
 xmlIndexerProperties type=filePerDocument namespace=default
   field name=xmlcontent xpath=//* type=Unstored boost=1.0 / 
 /xmlIndexerProperties

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira






RE: What's the status of Nutch-GUI?

2006-11-21 Thread Armel T. Nene
Chris, Rida,

Here the changes that I have made to XMLParseConfig.java in the
populateConfig(Document doc) method:


if (elemNode.getAttribute(nodeXpath) != null) {
String nodeXpath =
elemNode.getAttributeValue(namespace);
xip.setNodeXpath(nodeXpath);
}
List fieldList = XPath.selectNodes(elemNode,
field);

if(fieldList != null) // modified 20062011
by Armel
{
for (int j = 0; j  fieldList.size(); j++) {
Element elem = (Element)
fieldList.get(j);
XMLField xf =
populateXMLField(elem);
fieldsColl.add(xf);
}
}

/*
 * modifiied by Armel
 * 20062011
 * if fieldList is empty because it doesn't
contain
 * an element field
 */
if(fieldList == null){
   XMLField xf =
populateXMLField(elemNode);
fieldsColl.add(xf);
}

And the populateXMLField(Element el) method:

if (elem.getAttribute(name) != null) 
xf.setFieldName(elem.getAttributeValue(name));

if(elem.getAttribute(name)== null)// modified by Armel
{
List att = elem.getAttributes();
if(att != null){ // modified by Armel - loop and create
field accondingly
for (int i = 0; i  att.size(); i++){
   Attribute at = (Attribute)att.get(i);
 
xf.setFieldName(elem.getAttributeValue(at.getName()));
}
}
if (elem.getAttribute(xpath) != null)
xf.setFieldXPath(elem.getAttributeValue(xpath));

this is supposed to do the feature I want to implement, please advise.

Armel

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: 20 November 2006 23:30
To: nutch-dev@lucene.apache.org
Subject: Re: What's the status of Nutch-GUI?

Hi Armel,

On 11/20/06 1:44 PM, Armel T. Nene [EMAIL PROTECTED] wrote:

 Hi Chris,
 
 I am trying to extend parse-xml to enable the creation of lucene fields
 straight from an xml file. For example, a database table that has been
parse
 as an XML file should be stored in the index with the relevant fields,
i.e.
 customer name, address and so on. This file will not have a namespace
 associated with it and should not be stored as xmlcontent in the
database.
 Currently, parse-xml looks for known fields in the document and stores the
 associated values with the field name. I have added an extra conditions as
 if the known fields are not present in the current document, the element
or
 node in the document should be the new field stored in the index with
their
 value.

I think that this is fine.
 
 Therefore, when parse-xml receives an xml document with no namespace
 available, it will parse the document and store it element name as new
field
 in the index and the element associated value.
 
 Let me know if I am on the right track because I know I don't have to
write
 a separate plugin for this feature but just extending ( or modifying)
 parse-xml.

I think that parse-xml will support what you are talking about. In terms of
the check that you are doing to see if a field exists or not before adding
another value for it in the index, as I understood Lucene, I believe that
you could just omit this check and add the field regardless. If you add
multiple values for the same field in a Document, e.g:

snip
Document doc = new Document();

doc.add(new Field(fieldname, fieldvalue, ...));
doc.add(new Field(fieldname, fieldvalue2,...));

/snip

Both the values fieldvalue and fieldvalue2 will both get stored in the
index for the key fieldname. So, if I understand you correctly (which I
may not ;) ), then I think you can omit the check that you are talking about
above and just go with adding the same field name 2x.

HTH,
  Chris

 
 Cheers,
 
 Armel
 
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: 20 November 2006 18:40
 To: nutch-dev@lucene.apache.org
 Subject: Re: What's the status of Nutch-GUI?
 
 Hi Sami and Scott,
 
  This is on my TO-DO list as one of the items that I will begin working on
 getting into the sources as a committer. Additionally, I plan on
integrating
 and testing the parse-xml plugin into the source tree. As soon as I get my
 Apache account and SVN access, I

File Protocol

2006-11-15 Thread Armel T. Nene
I want to implement Nutch crawl a filesystem and if the content of the
filesystem has changed since last crawled then and the system should be
fetched again. I studied the code for the Adaptive Re-Fetch cycle but the
patch is out of date as Nutch has implemented other features. Also, I don't
want to change anything to the core code so that I can easily migrate to
newer version. I want to develop the feature as a plugin similar to the
Protocol-File plugin.

 

I have been digging in the source code for the Protocol-File plugin and
therefore have a few questions:

 

My Nutch Revision is: 475201 from the subversion server.

 

In the class File.java (Protocol-File plugin) , the getProtolOuput method
has a condition as follow:

 

Line 62

else if ((code = 300  code  400)  code != 304) { // handle
redirect

if (redirects == MAX_REDIRECTS)

throw new FileException(Too many redirects:  +
url);

u = new URL(response.getHeader(Location));

redirects++;

if (LOG.isTraceEnabled()) {

LOG.trace(redirect to  + u);

}

 

In my case, if the file has not been modified, the code will be 304 (NOT
MODIFIED). I want to know the effect of this line on the CrawlDB. The file
should not be removed or marked as GONE but the
CrawlDatum.STATUS_FETCH_GONE. If that's the case already, this that mean I
don't have to write a plugin to handle the checking of unmodified content.
If not, tell me how the Protocol-File plugins check for unmodified content
as it says it mimic an http response.

 

Armel



RE: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-11-12 Thread Armel T. Nene
Andrzej, the feature that I am after can be implemented by this patch if I
just adapt it right. I am not sure of this but the patch seems a little bit
old to be implemented in the latest release of Nutch 0.8.1. 

I want to implement a feature where the fetcher will fetch files but only
add them if there have been modified after the latest fetch time. Now, I
want to implement that on a filesystem first and then update later for
network fetching. I would like to have a look at your full source code for
your patch in a zip file if possible. Once the feature implemented, I will
post it back here. I'd like to start working from your code first. You can
either make the source code available here or mail them to me at armel dot
nene @ idna-solutions dot com.


-Original Message-
From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: 12 November 2006 19:39
To: nutch-dev@lucene.apache.org
Subject: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting
umodified content

[
http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12449170
] 

Andrzej Bialecki  commented on NUTCH-61:


Unfortunately, this patch hasn't been applied yet, due to its complexity and
lack of testing.

But it will be, sooner or later, because this functionality is required for
any serious use.

I'm planning to bring this patch to the latest trunk, and then apply it
piece-wise over the next couple of weeks.

 Adaptive re-fetch interval. Detecting umodified content
 ---

 Key: NUTCH-61
 URL: http://issues.apache.org/jira/browse/NUTCH-61
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 
 Attachments: 20050606.diff, 20051230.txt, 20060227.txt,
nutch-61-417287.patch


 Currently Nutch doesn't adjust automatically its re-fetch period, no
matter if individual pages change seldom or frequently. The goal of these
changes is to extend the current codebase to support various possible
adjustments to re-fetch times and intervals, and specifically a re-fetch
schedule which tries to adapt the period between consecutive fetches to the
period of content changes.
 Also, these patches implement checking if the content has changed since
last fetching; protocol plugins are also changed to make use of this
information, so that if content is unmodified it doesn't have to be fetched
and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira