Re: File system
If you are talking about Nutch Contents which are stored in the segments during fetching of pages, then you would need to write MapReduce job to read in the Contents object and do whatever processing you desire. Dennis oSilvio wrote: Very useful information, thanks! But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene? Dennis Kubes-2 wrote: The nutch databases are either SequenceFile or MapFile formats which store key and value pairs. Their keys and values are Writable implementations which translate an object into it byte equivalent and vice versa. Data and index files are MapFile format. Data is a SequenceFile, index is an index used by MapFiles for seeking to a specific key. Please see the hadoop wiki for more information about Sequence and Map files and writable formats. Dennis oSilvio wrote: Do somebody know how do the file structure works, briefly? It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files. Thanks Silvio
Re: File system
I've seen it now thanks for the attention oSilvio wrote: Very useful information, thanks! But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene? Dennis Kubes-2 wrote: The nutch databases are either SequenceFile or MapFile formats which store key and value pairs. Their keys and values are Writable implementations which translate an object into it byte equivalent and vice versa. Data and index files are MapFile format. Data is a SequenceFile, index is an index used by MapFiles for seeking to a specific key. Please see the hadoop wiki for more information about Sequence and Map files and writable formats. Dennis oSilvio wrote: Do somebody know how do the file structure works, briefly? It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files. Thanks Silvio -- View this message in context: http://www.nabble.com/File-system-tp21022587p2108.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: File system
Very useful information, thanks! But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene? Dennis Kubes-2 wrote: The nutch databases are either SequenceFile or MapFile formats which store key and value pairs. Their keys and values are Writable implementations which translate an object into it byte equivalent and vice versa. Data and index files are MapFile format. Data is a SequenceFile, index is an index used by MapFiles for seeking to a specific key. Please see the hadoop wiki for more information about Sequence and Map files and writable formats. Dennis oSilvio wrote: Do somebody know how do the file structure works, briefly? It seems that the data are compressed or something, its not possible to understand whats recorded in the data nor index files. Thanks Silvio -- View this message in context: http://www.nabble.com/File-system-tp21022587p21032357.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Build failed in Hudson: Nutch-trunk #663
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/663/changes -- [...truncated 2223 lines...] A src/plugin/protocol-http/src/test/org/apache/nutch A src/plugin/protocol-http/src/test/org/apache/nutch/protocol A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http A src/plugin/protocol-http/src/java A src/plugin/protocol-http/src/java/org A src/plugin/protocol-http/src/java/org/apache A src/plugin/protocol-http/src/java/org/apache/nutch A src/plugin/protocol-http/src/java/org/apache/nutch/protocol A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http AU src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html AUsrc/plugin/protocol-http/plugin.xml AUsrc/plugin/protocol-http/build.xml A bin AUbin/nutch A docs A docs/ms A docs/ms/search.html A docs/ms/help.html A docs/ms/about.html A docs/zh A docs/zh/search.html A docs/zh/help.html A docs/zh/about.html A docs/ca A docs/ca/search.html A docs/ca/help.html A docs/ca/about.html A docs/pt A docs/pt/search.html A docs/pt/help.html A docs/pt/about.html A docs/sr AUdocs/sr/search.html AUdocs/sr/help.html AUdocs/sr/about.html A docs/sv A docs/sv/search.html A docs/sv/help.html A docs/sv/about.html A docs/de A docs/de/search.html A docs/de/help.html A docs/de/about.html A docs/fi A docs/fi/search.html A docs/fi/help.html A docs/fi/about.html A docs/en A docs/en/search.html A docs/en/help.html A docs/en/about.html A docs/es A docs/es/search.html A docs/es/help.html A docs/es/about.html A docs/fr A docs/fr/search.html AUdocs/fr/help.html A docs/fr/about.html A docs/jp A docs/jp/search.html A docs/jp/help.html A docs/jp/about.html A docs/nl A docs/nl/search.html A docs/nl/help.html A docs/nl/about.html A docs/sh AUdocs/sh/search.html AUdocs/sh/help.html AUdocs/sh/about.html A docs/th A docs/th/search.html A docs/th/help.html A docs/th/about.html A docs/pl A docs/pl/search.html A docs/pl/help.html A docs/pl/about.html A docs/it AUdocs/it/search.html AUdocs/it/help.html AUdocs/it/about.html A docs/img A docs/img/lang AUdocs/img/lang/romanian.png AUdocs/img/lang/bulgarian.png AUdocs/img/lang/spanish.png AUdocs/img/lang/danish.png AUdocs/img/lang/dutch.png AUdocs/img/lang/icelandic.png AUdocs/img/lang/hungarian.png AUdocs/img/lang/russian.png AUdocs/img/lang/japanese.png AUdocs/img/lang/turkish.png AUdocs/img/lang/suomi.png AUdocs/img/lang/lithuanian.png AUdocs/img/lang/czech.png AUdocs/img/lang/greek.png AUdocs/img/lang/galego.png AUdocs/img/lang/polish.png AUdocs/img/lang/latvian.png AUdocs/img/lang/croatian.png AUdocs/img/lang/portuguese.png AUdocs/img/lang/french.png AUdocs/img/lang/swedish.png AUdocs/img/lang/german.png AUdocs/img/lang/chinese.png AUdocs/img/lang/malaysian.png AUdocs/img/lang/korean.png AUdocs/img/lang/arabic.png AUdocs/img/lang/italian.png AUdocs/img/lang/brazil.png AUdocs/img/lang/catala.png AUdocs/img/lang/thai.png AUdocs/img/lang/indonesian.png AUdocs/img/lang/norwegian.png AUdocs/img/lang/english.png AUdocs/img/poweredbynutch_01.gif AUdocs/img/poweredbynutch_02.gif A docs/img/reiter AUdocs/img/reiter/reiter_inactive_le.gif AUdocs/img/reiter/_spacer_cc.gif AUdocs/img/reiter/reiter_inactive_le1.gif AUdocs/img/reiter/bg_subnavi.gif AUdocs/img/reiter/002bg_fle.gif AUdocs/img/reiter/spacer_66.gif AUdocs/img/reiter/ul.gif AUdocs/img/reiter/_bg_reiter.gif AUdocs/img/reiter/logo_nutch.gif AUdocs/img/reiter/_bg_reiter_inactive.gif AUdocs/img/reiter/002bg_fre.gif AUdocs/img/reiter/reiter_inactive_ri.gif AUdocs/img/reiter/robots.gif AUdocs/img/reiter/spacer_ff9900.gif AUdocs/img/favicon.ico A docs/hu A docs/hu/search.html A docs/hu/help.html A docs/hu/about.html A index.html
Issue with searching keywords
Hi, I have used the keywords plugin to crawl the meta keyword details which happens fine. Here is the issue in case there is a keyword. 1.Say the keyword is RedHat (R and H in capital crawled from the site) On searching using keywords:RedHat nutch returns 0 results .. But on searching using keywords:redhat nutch returns N results which is the desired out put 2.In case there is space in the keywords like RedHat Linux searching using keywords:RedHat Linux or keywords:redhat linux all returns 0 results. What can be done to solve this issue 1.Shouldn't there be spaces in the keywords. 2.Is there any rules for searching the keywords. Please give your inputs. Thanks in advance, Rinesh. -- View this message in context: http://www.nabble.com/Issue-with-searching-keywords-tp21047128p21047128.html Sent from the Nutch - Dev mailing list archive at Nabble.com.