Hi All, I'm a new on nutch and solr, with following platforms: - nutch 2.1 - solr 4.0 - jdk 1.7 on ubuntu 10.04
I'm also part of "member" of the legendary implementation nutch with MySQL at http://nlp.solutions.asia/?p=180 ;-) I have installed all of above successfully with some minors corrections on table structure (i.e. change "typ" column into "type" and also change its size to varchar(64)). I created an index.html (with simple text inside) at URL http://localhost/sapi/ and put it into urls/seed.txt as source URL crawled. For testing I created 5 inlinks which contains 5 documents with 2 formats (pdf and odt) and filename format (filename with space and no-space) in index.html file: 1. http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf 2. http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf 3. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf 4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt 5. http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt *the chars %20 on links above is actually whitespace character. I only copied what my browser read/interpret and converted into safe URLs. **Converting the rules above (space char) has also applied on regex-normalize.xml file. Here are some facts and doubts I got after play around with nutch and solr: 1. All of those docs has parsed "successfully" since the status is "2". 2. Why I called it "successfully" is because some of docs (#1 and #2 above) are not having the value on "text" column in webpage MySQL table. It means those docs are failed to parse by nutch. CMIIW. 3. The number of docs (numdocs) reported on Solr Admin is always 2 docs! :( -- only indexing index.html and 4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt successfully indexed by Solr. Even I do repeat the crawl and reindex process many times. Below are 2 lines commands in single bash script to crawl and index my page: #!/bin/bash ./runtime/local/bin/nutch crawl urls -depth 3 -topN 5 ./runtime/local/bin/nutch solrindex http://localhost:8080/solr/ -reindex Appreciate for any help. TIA -- wassalam, [bayu]

