Re: Not all parsed docs is indexed & inconsistent parsed docs.

Bayu Widyasanyata Mon, 24 Dec 2012 17:34:44 -0800

Hi,

==Update==

Checking hadoop.log found some interesting info that the parsing was
not completed successfully.

...
2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/vnd.oasis.opendocument.text, but they are not mapped
to it  in the parse-plugins.xml file
2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
parse content 
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
of type application/vnd.oasis.opendocument.text
2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file
2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
parse content 
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
of type application/pdf
2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
of type application/vnd.oasis.opendocument.text
2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
parse content 
http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
of type application/pdf
2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
parse content 
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
of type application/pdf
...

I checked the parse-plugins.xml file and found no plugins handling
type of application/pdf and application/vnd.oasis.opendocument.text.
I knew that parse-tika handle PDF files but why those errors were still occurs?

Any documents/links could explain in easy way to install and activate
those supported plugins as mentioned at [1] on nutch parser?

[1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format

Thanks,

On Tue, Dec 25, 2012 at 7:16 AM, Bayu Widyasanyata
<[email protected]> wrote:
> Hi All,
>
> I'm a new on nutch and solr, with following platforms:
> - nutch 2.1
> - solr 4.0
> - jdk 1.7 on ubuntu 10.04
>
> I'm also part of "member" of the legendary implementation nutch with
> MySQL at http://nlp.solutions.asia/?p=180 ;-)
> I have installed all of above successfully with some minors
> corrections on table structure (i.e. change "typ" column into "type"
> and also change its size to varchar(64)).
>
> I created an index.html (with simple text inside) at URL
> http://localhost/sapi/ and put it into urls/seed.txt as source URL
> crawled.
> For testing I created 5 inlinks which contains 5 documents with 2
> formats (pdf and odt) and filename format (filename with space and
> no-space) in index.html file:
>
> 1. http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> 2. 
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 3. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 4. http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> 5. http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>
> *the chars %20 on links above is actually whitespace character. I only
> copied what my browser read/interpret and converted into safe URLs.
> **Converting the rules above (space char) has also applied on
> regex-normalize.xml file.
>
> Here are some facts and doubts I got after play around with nutch and solr:
>
> 1. All of those docs has parsed "successfully" since the status is "2".
> 2. Why I called it "successfully" is because some of docs (#1 and #2
> above) are not having the value on "text" column in webpage MySQL
> table. It means those docs are failed to parse by nutch. CMIIW.
> 3. The number of docs (numdocs) reported on Solr Admin is always 2
> docs! :( -- only indexing index.html and 4.
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> successfully indexed by Solr. Even I do repeat the crawl and reindex
> process many times.
>
> Below are 2 lines commands in single bash script to crawl and index my page:
>
> #!/bin/bash
> ./runtime/local/bin/nutch crawl urls -depth 3 -topN 5
> ./runtime/local/bin/nutch solrindex http://localhost:8080/solr/ -reindex
>
> Appreciate for any help.
>
> TIA
>
> --
> wassalam,
> [bayu]

-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Reply via email to