Re: Not all parsed docs is indexed & inconsistent parsed docs.

Bayu Widyasanyata Sat, 12 Jan 2013 20:49:45 -0800

On Sun, Jan 13, 2013 at 12:02 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
>
> On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata
> <[email protected]>wrote:
>
> >
> > We can see that some of parse processes were not completed successfully.
> >
>
> Yes I see this. I also see that you have a http.proxy.port = 8080 but no
> proxy host and that the protocol-httpclient plugin is not activated.
>

That's tomcat port for Solr.
Should we activate the proxy setting?

> I also see some strange fetcher behaviour as it seems to fetch the server
> instance e.g. 2013-01-12 05:37:41,987 INFO  fetcher.FetcherJob - fetching
> http://localhost/, however I assume there is no document @ this location
> on
> the server...
>
>
There is index.html on that URL.
Here is the content:

<html>
<head><title>Contoh link dokumen</title></head>
<body>
<h3>testing dokumen</h3>
<p>
Namun dalam realitas kita melihat banyak manusia modern justeru bersikap
sebaliknya. Dan ini tidak saja diperlihatkan oleh sembarang manusia. Bahkan
sebagian manusia yang mengaku muslim sekalipun menampilkan sikap terbalik.
Bila menyangkut urusan peluang keberhasilan di dunia ia menjadi sangat
serius. Ia kerahkan perhatian, waktu, tenaga dan uang tanpa keraguan. Namun
bila menyangkut urusan peluang keberhasilan di akhirat ia malah bersikap
setengah hati bahkan bermain-main dan bersenda-gurau. Ia sangat fokus akan
sukses dunia namun sangat tidak peduli sukses akhirat. Seolah sukses dunia
merupakan sesuatu yang hakiki sedangkan sukses akhirat hanyalah mimpi tanpa
bukti. Mengapa hal ini terjadi?
</p>
<ol>
<li><a href="sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf">ini
contoh dokumen tak pakai spasi</a></li>
<li><a href="sapi/spasi Akhirat Lebih Utama Daripada Dunia.pdf">contoh
pakai sepasi</a></li>
<li><a href="sapi/Akhirat Lebih Utama Daripada Dunia.pdf">contoh pakai
sepasi ke-2</a></li>
<li><a href="sapi/Akhirat Lebih Utama Daripada Dunia.odt">file odt pakai
spasi kosong</a></li>
<li><a href="sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt">file odt pakai
underscore</a></li>
</ol>
Ini dokumen tambahan <a href="sapi/Solr-install-v2.pdf">Instalasi Solr</a>
yang Bayu buat :-).
</body>
</html>

This index.html is successfully parsed and indexed.
I can see the records on MySQL database.

Only this index.html and single odt I mentioned before can be parsed and
the contents exist on database.
But the strange is the whole status of documents fetched is 2.
If I'm not mistake the status 5 is document indexing successfully. CMIIW.

That being said, as we've established fetching does not seem to be the
> problem.
>
> Unless you wish to skip parsing for truncated documents then you will need
> to increase the http.content.limit to something over ~40K. This will then
> remove the following log output (meaning that the document should be fully
> parsed)
> 2013-01-12 05:38:27,508 WARN  parse.ParserJob -
> http://localhost/sapi/Solr-install-v2.pdf skipped. Content of size 395125
> was truncated to 65536
> You may also wish to consider the parser.skip.truncated property in
> nutch-site.xml
>
>
OK. I can increase it.

> I don't suppose these PDF's are password protected or something like that?
>
>
Nope.
I just create .odt and save nto PDF files.

> I would also explicitly map the content type
> application/vnd.oasis.opendocument.text to parse-tika in parse-plugins.xml.
>
> 2013-01-12 05:39:07,594 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content type
> application/vnd.oasis.opendocument.text, but they are not mapped to it  in
> the parse-plugins.xml file
>

Yupe. I will do it.

So, why the PDF parser could not parsed completely to whole PDFs docs?

-- 
wassalam,
[bayu]

Re: Not all parsed docs is indexed & inconsistent parsed docs.

Reply via email to