Re: File system

2008-12-16 Thread Dennis Kubes
If you are talking about Nutch Contents which are stored in the segments 
during fetching of pages, then you would need to write  MapReduce job to 
read in the Contents object and do whatever processing you desire.


Dennis

oSilvio wrote:

Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can
find no algorithm available by nutch, nor the process used to store the
data. Do you know if it is possible to extract using lucene?

 


Dennis Kubes-2 wrote:
The nutch databases are either SequenceFile or MapFile formats which 
store key and value pairs.  Their keys and values are Writable 
implementations which translate an object into it byte equivalent and 
vice versa.


Data and index files are MapFile format.  Data is a SequenceFile, index 
is an index used by MapFiles for seeking to a specific key.


Please see the hadoop wiki for more information about Sequence and Map 
files and writable formats.


Dennis

oSilvio wrote:
Do somebody know how do the file structure works, briefly? 
It seems that the data are compressed or something, its not possible to

understand whats recorded in the data nor index files.
Thanks
Silvio






Re: File system

2008-12-16 Thread oSilvio

I've seen it now thanks for the attention



oSilvio wrote:
 
 Very useful information, thanks!
 But in order to extract the data inside those files (like html pages) I
 can find no algorithm available by nutch, nor the process used to store
 the data. Do you know if it is possible to extract using lucene?
 
  
 
 Dennis Kubes-2 wrote:
 
 The nutch databases are either SequenceFile or MapFile formats which 
 store key and value pairs.  Their keys and values are Writable 
 implementations which translate an object into it byte equivalent and 
 vice versa.
 
 Data and index files are MapFile format.  Data is a SequenceFile, index 
 is an index used by MapFiles for seeking to a specific key.
 
 Please see the hadoop wiki for more information about Sequence and Map 
 files and writable formats.
 
 Dennis
 
 oSilvio wrote:
 Do somebody know how do the file structure works, briefly? 
 It seems that the data are compressed or something, its not possible to
 understand whats recorded in the data nor index files.
 Thanks
 Silvio
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/File-system-tp21022587p2108.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: File system

2008-12-16 Thread oSilvio

Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can
find no algorithm available by nutch, nor the process used to store the
data. Do you know if it is possible to extract using lucene?

 

Dennis Kubes-2 wrote:
 
 The nutch databases are either SequenceFile or MapFile formats which 
 store key and value pairs.  Their keys and values are Writable 
 implementations which translate an object into it byte equivalent and 
 vice versa.
 
 Data and index files are MapFile format.  Data is a SequenceFile, index 
 is an index used by MapFiles for seeking to a specific key.
 
 Please see the hadoop wiki for more information about Sequence and Map 
 files and writable formats.
 
 Dennis
 
 oSilvio wrote:
 Do somebody know how do the file structure works, briefly? 
 It seems that the data are compressed or something, its not possible to
 understand whats recorded in the data nor index files.
 Thanks
 Silvio
 
 

-- 
View this message in context: 
http://www.nabble.com/File-system-tp21022587p21032357.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Build failed in Hudson: Nutch-trunk #663

2008-12-16 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/663/changes

--
[...truncated 2223 lines...]
A src/plugin/protocol-http/src/test/org/apache/nutch
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http
A src/plugin/protocol-http/src/java
A src/plugin/protocol-http/src/java/org
A src/plugin/protocol-http/src/java/org/apache
A src/plugin/protocol-http/src/java/org/apache/nutch
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html
AUsrc/plugin/protocol-http/plugin.xml
AUsrc/plugin/protocol-http/build.xml
A bin
AUbin/nutch
A docs
A docs/ms
A docs/ms/search.html
A docs/ms/help.html
A docs/ms/about.html
A docs/zh
A docs/zh/search.html
A docs/zh/help.html
A docs/zh/about.html
A docs/ca
A docs/ca/search.html
A docs/ca/help.html
A docs/ca/about.html
A docs/pt
A docs/pt/search.html
A docs/pt/help.html
A docs/pt/about.html
A docs/sr
AUdocs/sr/search.html
AUdocs/sr/help.html
AUdocs/sr/about.html
A docs/sv
A docs/sv/search.html
A docs/sv/help.html
A docs/sv/about.html
A docs/de
A docs/de/search.html
A docs/de/help.html
A docs/de/about.html
A docs/fi
A docs/fi/search.html
A docs/fi/help.html
A docs/fi/about.html
A docs/en
A docs/en/search.html
A docs/en/help.html
A docs/en/about.html
A docs/es
A docs/es/search.html
A docs/es/help.html
A docs/es/about.html
A docs/fr
A docs/fr/search.html
AUdocs/fr/help.html
A docs/fr/about.html
A docs/jp
A docs/jp/search.html
A docs/jp/help.html
A docs/jp/about.html
A docs/nl
A docs/nl/search.html
A docs/nl/help.html
A docs/nl/about.html
A docs/sh
AUdocs/sh/search.html
AUdocs/sh/help.html
AUdocs/sh/about.html
A docs/th
A docs/th/search.html
A docs/th/help.html
A docs/th/about.html
A docs/pl
A docs/pl/search.html
A docs/pl/help.html
A docs/pl/about.html
A docs/it
AUdocs/it/search.html
AUdocs/it/help.html
AUdocs/it/about.html
A docs/img
A docs/img/lang
AUdocs/img/lang/romanian.png
AUdocs/img/lang/bulgarian.png
AUdocs/img/lang/spanish.png
AUdocs/img/lang/danish.png
AUdocs/img/lang/dutch.png
AUdocs/img/lang/icelandic.png
AUdocs/img/lang/hungarian.png
AUdocs/img/lang/russian.png
AUdocs/img/lang/japanese.png
AUdocs/img/lang/turkish.png
AUdocs/img/lang/suomi.png
AUdocs/img/lang/lithuanian.png
AUdocs/img/lang/czech.png
AUdocs/img/lang/greek.png
AUdocs/img/lang/galego.png
AUdocs/img/lang/polish.png
AUdocs/img/lang/latvian.png
AUdocs/img/lang/croatian.png
AUdocs/img/lang/portuguese.png
AUdocs/img/lang/french.png
AUdocs/img/lang/swedish.png
AUdocs/img/lang/german.png
AUdocs/img/lang/chinese.png
AUdocs/img/lang/malaysian.png
AUdocs/img/lang/korean.png
AUdocs/img/lang/arabic.png
AUdocs/img/lang/italian.png
AUdocs/img/lang/brazil.png
AUdocs/img/lang/catala.png
AUdocs/img/lang/thai.png
AUdocs/img/lang/indonesian.png
AUdocs/img/lang/norwegian.png
AUdocs/img/lang/english.png
AUdocs/img/poweredbynutch_01.gif
AUdocs/img/poweredbynutch_02.gif
A docs/img/reiter
AUdocs/img/reiter/reiter_inactive_le.gif
AUdocs/img/reiter/_spacer_cc.gif
AUdocs/img/reiter/reiter_inactive_le1.gif
AUdocs/img/reiter/bg_subnavi.gif
AUdocs/img/reiter/002bg_fle.gif
AUdocs/img/reiter/spacer_66.gif
AUdocs/img/reiter/ul.gif
AUdocs/img/reiter/_bg_reiter.gif
AUdocs/img/reiter/logo_nutch.gif
AUdocs/img/reiter/_bg_reiter_inactive.gif
AUdocs/img/reiter/002bg_fre.gif
AUdocs/img/reiter/reiter_inactive_ri.gif
AUdocs/img/reiter/robots.gif
AUdocs/img/reiter/spacer_ff9900.gif
AUdocs/img/favicon.ico
A docs/hu
A docs/hu/search.html
A docs/hu/help.html
A docs/hu/about.html
A index.html

Issue with searching keywords

2008-12-16 Thread Rinesh1

Hi,
 I have used the keywords plugin to crawl the meta keyword details which
happens fine.
 Here is the issue in case there is a keyword.
 1.Say the keyword is RedHat (R and H in capital crawled from the site)
 
 On searching using keywords:RedHat nutch returns 0 results ..
 But on searching using keywords:redhat nutch returns N results which is
the desired out put


 2.In case there is space in the keywords like RedHat Linux
 searching using keywords:RedHat Linux or keywords:redhat linux all
returns 0 results.

 What can be done to solve this issue
 1.Shouldn't there be spaces in the keywords.
 2.Is there any rules for searching the keywords.

 Please give your inputs.

Thanks in advance,
Rinesh.
-- 
View this message in context: 
http://www.nabble.com/Issue-with-searching-keywords-tp21047128p21047128.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.