Re: Nutch vs Lucidworks Fusion

2014-10-02 Thread Bayu Widyasanyata
I haven't used Fusion yet, but already play with Lucidworks 2.8. The native embedded crawler for Lucidworks is Apperture [0]. IMHO nutch is better than Apperture in terms of stability, speed and features. [0] http://sourceforge.net/projects/aperture/ On Wed, Oct 1, 2014 at 9:19 AM, Jorge Luis

Re: Incremental crawling with nutch

2014-06-07 Thread Bayu Widyasanyata
to use for testing recrawl? maybe I do some steps wrong. Regards. On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Just curious, I will go back in lab and proof it --- wassalam, [bayu] /sent from Android phone/ On Jun 6, 2014 5:37 PM, Ali

Re: Crawling local file system - file not parse

2014-06-06 Thread Bayu Widyasanyata
helps, Sebastian On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote: Hi, I'm sure this is an old topic, but I still no luck crawling with it. It's a little bit harder than crawling web / http protocol :( Following are some important files I configured: (1) urls/seed.txt file

Re: Incremental crawling with nutch

2014-06-06 Thread Bayu Widyasanyata
Hi Ali, This blog [0] may helps. [0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com wrote: Thank you very much. But it is just a parameter for specifying the interval between re-crawls. The problem is

Re: Incremental crawling with nutch

2014-06-06 Thread Bayu Widyasanyata
mentioned. Regards. On Fri, Jun 6, 2014 at 2:14 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi Ali, This blog [0] may helps. [0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com

Crawling web and intranet files into single crawldb

2014-06-04 Thread Bayu Widyasanyata
Hi, I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web sources (http protocol). And now I want add file share data sources (file protocol) into current crawldb. What is the strategy or common practices to handle this situations? Thank you.- -- wassalam, [bayu]

Re: Crawling web and intranet files into single crawldb

2014-06-04 Thread Bayu Widyasanyata
Hi Markus, The following files should I configured: = prefix-urlfilter.txt: put file:// which is already configured. = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to -^(ftp|mailto): = urls/seed.txt: add new URL/file path. ...and start crawling. Is it enough? CMIIW Thanks-

Re: Crawling web and intranet files into single crawldb

2014-06-04 Thread Bayu Widyasanyata
Hi Markus, Did you mean I should remove file:// line from prefix-urlfilter.txt? When I checked with command: bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined urls/seed.txt, it returns: Checking combination of all URLFilters available -http://www.myurl.com

Re: Crawling web and intranet files into single crawldb

2014-06-04 Thread Bayu Widyasanyata
OK, thanks! :) On Wed, Jun 4, 2014 at 8:28 PM, Markus Jelsma markus.jel...@openindex.io wrote: ah yes. i am wrong, do not remove it :) -Original message- From:Bayu Widyasanyata bwidyasany...@gmail.com Sent:Wed 04-06-2014 15:25 Subject:Re: Crawling web and intranet files into

Crawling local file system - file not parse

2014-06-04 Thread Bayu Widyasanyata
Hi, I'm sure this is an old topic, but I still no luck crawling with it. It's a little bit harder than crawling web / http protocol :( Following are some important files I configured: (1) urls/seed.txt file://opt/searchengine/test/ which contains one file: -rw-r--r-- 1 bayu bayu 3272 Jun 5

Re: Pull in data from database (RDBMS)

2014-05-29 Thread Bayu Widyasanyata
at http://manifoldcf.apache.org? Might be a better fit for what you are describing. Not sure it does parsing though. On 23 May 2014 11:08, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi, Anyone could pointing me on documentation how to pull in (fetching) data from database (e.g

Re: Nutch fetch local files with arbitrary mapped URLs

2014-05-25 Thread Bayu Widyasanyata
Hi Martin, Just put and serves as common web server files inside their docroot. If their URIs are fixed-URL then you can create a local hostname with local dns support (not provided by Internet DNS). Hope it helps. --- wassalam, [bayu] /sent from Android phone/ On May 24, 2014 7:16 PM, Martin

Pull in data from database (RDBMS)

2014-05-23 Thread Bayu Widyasanyata
Hi, Anyone could pointing me on documentation how to pull in (fetching) data from database (e.g. common RDBMS such MySQL, etc.) with nutch? While the rest of process are nutch commons: parse and index them. Thanks in advance. -- wassalam, [bayu]

Re: Nutch survey

2014-05-21 Thread Bayu Widyasanyata
Done! Great Julien! On Wed, May 21, 2014 at 10:58 PM, Markus Jelsma markus.jel...@openindex.iowrote: Great! Done! :-)Julien Nioche lists.digitalpeb...@gmail.com schreef:Hi everyone! I had written a survey about Nutch and its uses and would be very grateful if you could take a couple of

Re: nutch dedup on 1.8

2014-05-19 Thread Bayu Widyasanyata
. Thanks! Julien On 15 May 2014 05:29, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi All, I want to run deduplications data on nutch 1.8 using command: nutch dedup solr_URL since nutch solrdedup command is not supported anymore on 1.8. But this command raised error: 2014-05-15

Re: Minor typo on Apache Nutch News - Tika 1.5

2014-05-19 Thread Bayu Widyasanyata
You're welcome! Great! On Sat, May 10, 2014 at 1:56 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Bayu, it's fixed now. Thanks! Sebastian On 05/06/2014 12:28 AM, Bayu Widyasanyata wrote: Hi, I think a mirror typo on this page [0] regarding latest Tika included

Re: Problem with regex url filter

2014-05-19 Thread Bayu Widyasanyata
itself? Thanks Paul On 5 May 2014 18:57, Bayu Widyasanyata bwidyasany...@gmail.com wrote: On Tue, May 6, 2014 at 6:05 AM, Paul Rogers paul.roge...@gmail.com wrote: By that do you mean using file:// as opposed to http:// crawling? Yupe. https://wiki.apache.org/nutch/FAQ

Re: Problem with regex url filter

2014-05-19 Thread Bayu Widyasanyata
that excludes directories (and their listings) but includes any files in them. Thanks P On 19 May 2014 09:31, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi Paul, Apologize for late reply since I have another tasks should be finished. The common practice if your website is common

nutch dedup on 1.8

2014-05-15 Thread Bayu Widyasanyata
Hi All, I want to run deduplications data on nutch 1.8 using command: nutch dedup solr_URL since nutch solrdedup command is not supported anymore on 1.8. But this command raised error: 2014-05-15 11:19:59,334 INFO crawl.DeduplicationJob - DeduplicationJob: starting at 2014-05-15 11:19:59

Minor typo on Apache Nutch News - Tika 1.5

2014-05-05 Thread Bayu Widyasanyata
Hi, I think a mirror typo on this page [0] regarding latest Tika included on nutch 1.8. It was written includes library upgrade to Apache Tika 1.4 while on detail changes it was actually Tika 1.5 [1] or this [2]. Thanks.- [0]

Re: Nutch 1.8 CrawlDb update error

2014-05-05 Thread Bayu Widyasanyata
I also experienced the same thing [checksum error] :( I couldn't avoid to delete segment and do refetch again... Deleting .crc files, or other files inside segments didn't help much. Thanks.- On Tue, May 6, 2014 at 2:55 AM, Sebastian Nagel wastl.na...@googlemail.comwrote: Caused by:

Re: Problem with regex url filter

2014-05-05 Thread Bayu Widyasanyata
On Mon, May 5, 2014 at 10:34 PM, Paul Rogers paul.roge...@gmail.com wrote: My question is how do I get nutch to crawl all the files on a web site not just the root url? Hi, nutch is acts as crawler, the same about we uses any Internet browser. nutch or we can't browse or crawl the pages that

Re: Problem with regex url filter

2014-05-05 Thread Bayu Widyasanyata
On Tue, May 6, 2014 at 6:05 AM, Paul Rogers paul.roge...@gmail.com wrote: By that do you mean using file:// as opposed to http:// crawling? Yupe. https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol -- wassalam, [bayu]

Re: One site only index.

2014-04-03 Thread Bayu Widyasanyata
Hi Shane, The regex-urlfilter.txt will exclude someurl.com when you do a/multiple cycle of inject generate fetch parse update solrupdate process. The regex-urlfilter.txt will also affects on updatedb and solrindex steps with -filter as parameter applied. Regards, On Thu, Apr 3, 2014 at

Re: Re: Please help - Nutch fetch command not fetching data

2014-02-22 Thread Bayu Widyasanyata
Hi, Have you check the hadoop.log?

How to check URL that have been indexed by Solr?

2014-02-17 Thread Bayu Widyasanyata
Hi, Sometimes we accidentally crawls unneeded URLs format until push them into last solrindex step. As we know we can drop or delete those URLs by add regex on regex-urlfilter.txt and do nutch updatedb. Then those URL will be dropped/deleted from crawldb database. But, how to ensure URLs that

Re: Nutch didn't (fail) to create new segment dir

2014-02-15 Thread Bayu Widyasanyata
I just fixed the pattern with following: -^http://.*ccm_paging_p.*$ And put it before. Case closed. Thank you Tejas! On Sat, Feb 15, 2014 at 8:53 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi Tejas, You're right! It's my mistake! regex-urlfilter.txt problems. It starts when I

Nutch didn't (fail) to create new segment dir

2014-02-14 Thread Bayu Widyasanyata
Hi, From what I know that nutch generate will create a new segment directory every round nutch is running. I have a problem (never happened before) that nutch won't create new segment. It always only fetch and parse the latest segment. - from the logs: 2014-02-15 07:20:02,036 INFO

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

2014-02-02 Thread Bayu Widyasanyata
Yupe, thanks! --- wassalam, [bayu] /sent from Android phone/ On Feb 2, 2014 10:51 PM, Tejas Patil tejas.patil...@gmail.com wrote: On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi Tejas, It's works and great! :) After reconfigured and many times

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

2014-01-26 Thread Bayu Widyasanyata
this is verified and everything looks good from the crawling side, run solrindex and check if you get the query results. If not, then there was a problem while indexing the stuff. Thanks, Tejas On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, I just

Re: How to set JVM heap size on crawl script?

2013-11-03 Thread Bayu Widyasanyata
=... ... That's no a system property because the argument -D... comes after the class to be run. Most (if not all) Nutch tools/commands use ToolRunner.run() which supports generic options (among them -Dproperty=value). Sebastian On 11/01/2013 12:54 AM, Bayu Widyasanyata wrote: Hi, One more

Re: How to set JVM heap size on crawl script?

2013-10-31 Thread Bayu Widyasanyata
(see comments in bin/nutch ): NUTCH_HEAPSIZE (in MB) NUTCH_OPTS Extra Java runtime options export NUTCH_HEAPSIZE=2048 should work but also export NUTCH_OPTS=-Xmx2048m The latter one would allow to add more Java options separated by space. Sebastian 2013/10/30 Bayu

Re: How to set JVM heap size on crawl script?

2013-10-31 Thread Bayu Widyasanyata
On Thu, Oct 31, 2013 at 8:43 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi Sebastian, Thanks for the hint. --- wassalam, [bayu] /sent from Android phone/ On Oct 30, 2013 7:54 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, the script bin/crawl executes bin/nutch

Re: Delete specific host DB index on Solr database

2013-10-02 Thread Bayu Widyasanyata
documents from solr. On Wed, Oct 2, 2013 at 7:24 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, One of my seed URL was changed to new CMS which affect to its URI presentation format. How could I delete the old format of CMS on Solr database, then I could recrawl and reindex again

How to protect Solr 4.1 Admin page?

2013-02-07 Thread Bayu Widyasanyata
Hi, I'm sure it's an old question.. I just want protecting Admin page (/solr) with Basic Authentication. But I can't found fine answer yet out there. I use Solr 4.1 with Apache Tomcat/7.0.35. Could anyone give me a quick hints or links? Thanks in advance! -- wassalam, [bayu]

Re: How to protect Solr 4.1 Admin page?

2013-02-07 Thread Bayu Widyasanyata
Ooops.. apologize to wrong posting here! :( It should be to solr-user group. On Fri, Feb 8, 2013 at 2:18 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, I'm sure it's an old question.. I just want protecting Admin page (/solr) with Basic Authentication. But I can't found fine answer

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-15 Thread Bayu Widyasanyata
On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Did you check the http.accept property in nutch-site.xml I copied from nutch-default.xml, then add application/pdf: property namehttp.accept/name

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-15 Thread Bayu Widyasanyata
%20Utama%20Daripada%20Dunia.pdf - Url --- http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf - Metadata - xmp:CreatorTool : Writer meta:author : Bayu Widyasanyata xmpTPg:NPages : 1 dc:creator : Bayu Widyasanyata Content-Type

Re: How segments is created?

2013-01-12 Thread Bayu Widyasanyata
On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote: Well, if you know that the front page is updated frequently, set db.fetch.interval.default to lower value so that urls will be eligible for re-fetch sooner. By default, if a url is fetched successfully, it becomes

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Bayu Widyasanyata
know what should the correct filename of the jar file. mysql.jar or should named mysql-connector-java.jar?? Which is nutch will call/refer? On Tue, Jan 8, 2013 at 2:47 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi Lewis, Thanks for the link! On Tue, Jan 8, 2013 at 6:11 AM, Lewis John

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Bayu Widyasanyata
Yes, I forgot that things even I already put on my notes on previous installation. I'm quite new on nutch and also Java developments :) Thanks! On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, java.io.IOException: java.lang.ClassNotFoundException:

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Bayu Widyasanyata
, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Yes, I forgot that things even I already put on my notes on previous installation. I'm quite new on nutch and also Java developments :) Thanks! On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-10 Thread Bayu Widyasanyata
For clarity, the log below is the about 4 of 5 my PDF docs that can't be parsed by nutch. On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: nutch parsing is still problem on pdf files. Only 1 pdf can be parsed successfully. 2013-01-11 08:11:23,679 WARN

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-07 Thread Bayu Widyasanyata
Hi Lewis, Thanks for the link! On Tue, Jan 8, 2013 at 6:11 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Bayu, On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Anyone can give me a hint? In parallel I changed to use nutch 1.6 binary

Re: Not all parsed docs is indexed inconsistent parsed docs.

2013-01-05 Thread Bayu Widyasanyata
? In parallel I changed to use nutch 1.6 binary and works well. But curious to use the latest of nutch 2.1. Thanks in advance! On Sun, Dec 30, 2012 at 1:46 PM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: Hi, Thank you for suggestions. And I was try to upgrade the Tika to 1.2 as mentioned

Re: generate.max.count was not affected

2013-01-05 Thread Bayu Widyasanyata
Problem fixed :) Many thanks! On Sun, Jan 6, 2013 at 9:15 AM, Bayu Widyasanyata bwidyasany...@gmail.comwrote: I think it was the problem, on my nutch-site.xml property namegenerate.max.per.host/name value100/value /property eventhough it's deprecated. OK, I

Re: Not all parsed docs is indexed inconsistent parsed docs.

2012-12-29 Thread Bayu Widyasanyata
, so unless there is something peculiar with all your files or setup, have you tried the: - Size of the files to see if they are over configured limits - used the nutch parsechecker command to test individual files Cheers, Dave On 25 Dec 2012, at 01:34, Bayu Widyasanyata bwidyasany

Not all parsed docs is indexed inconsistent parsed docs.

2012-12-24 Thread Bayu Widyasanyata
Hi All, I'm a new on nutch and solr, with following platforms: - nutch 2.1 - solr 4.0 - jdk 1.7 on ubuntu 10.04 I'm also part of member of the legendary implementation nutch with MySQL at http://nlp.solutions.asia/?p=180 ;-) I have installed all of above successfully with some minors corrections

Re: Not all parsed docs is indexed inconsistent parsed docs.

2012-12-24 Thread Bayu Widyasanyata
#Portable_Document_Format Thanks, On Tue, Dec 25, 2012 at 7:16 AM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi All, I'm a new on nutch and solr, with following platforms: - nutch 2.1 - solr 4.0 - jdk 1.7 on ubuntu 10.04 I'm also part of member of the legendary implementation nutch with MySQL