Hi, ==Update==
Checking hadoop.log found some interesting info that the parsing was not completed successfully. ... 2012-12-25 08:15:09,480 INFO parse.ParserJob - Parsing http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt 2012-12-25 08:15:09,480 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/vnd.oasis.opendocument.text, but they are not mapped to it in the parse-plugins.xml file 2012-12-25 08:15:09,517 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt of type application/vnd.oasis.opendocument.text 2012-12-25 08:15:09,520 INFO parse.ParserJob - Parsing http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf 2012-12-25 08:15:09,521 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file 2012-12-25 08:15:09,545 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf of type application/pdf 2012-12-25 08:15:09,551 INFO parse.ParserJob - Parsing http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt 2012-12-25 08:15:09,560 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt of type application/vnd.oasis.opendocument.text 2012-12-25 08:15:09,563 INFO parse.ParserJob - Parsing http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf 2012-12-25 08:15:09,590 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of type application/pdf 2012-12-25 08:15:09,597 INFO parse.ParserJob - Parsing http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf 2012-12-25 08:15:09,652 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf of type application/pdf ... I checked the parse-plugins.xml file and found no plugins handling type of application/pdf and application/vnd.oasis.opendocument.text. I knew that parse-tika handle PDF files but why those errors were still occurs? Any documents/links could explain in easy way to install and activate those supported plugins as mentioned at [1] on nutch parser? [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format Thanks, On Tue, Dec 25, 2012 at 7:16 AM, Bayu Widyasanyata <[email protected]> wrote: > Hi All, > > I'm a new on nutch and solr, with following platforms: > - nutch 2.1 > - solr 4.0 > - jdk 1.7 on ubuntu 10.04 > > I'm also part of "member" of the legendary implementation nutch with > MySQL at http://nlp.solutions.asia/?p=180 ;-) > I have installed all of above successfully with some minors > corrections on table structure (i.e. change "typ" column into "type" > and also change its size to varchar(64)). > > I created an index.html (with simple text inside) at URL > http://localhost/sapi/ and put it into urls/seed.txt as source URL > crawled. > For testing I created 5 inlinks which contains 5 documents with 2 > formats (pdf and odt) and filename format (filename with space and > no-space) in index.html file: > > 1. http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > 2. > http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > 3. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > 4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt > 5. http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt > > *the chars %20 on links above is actually whitespace character. I only > copied what my browser read/interpret and converted into safe URLs. > **Converting the rules above (space char) has also applied on > regex-normalize.xml file. > > Here are some facts and doubts I got after play around with nutch and solr: > > 1. All of those docs has parsed "successfully" since the status is "2". > 2. Why I called it "successfully" is because some of docs (#1 and #2 > above) are not having the value on "text" column in webpage MySQL > table. It means those docs are failed to parse by nutch. CMIIW. > 3. The number of docs (numdocs) reported on Solr Admin is always 2 > docs! :( -- only indexing index.html and 4. > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt > successfully indexed by Solr. Even I do repeat the crawl and reindex > process many times. > > Below are 2 lines commands in single bash script to crawl and index my page: > > #!/bin/bash > ./runtime/local/bin/nutch crawl urls -depth 3 -topN 5 > ./runtime/local/bin/nutch solrindex http://localhost:8080/solr/ -reindex > > Appreciate for any help. > > TIA > > -- > wassalam, > [bayu] -- wassalam, [bayu]

