I haven't used Fusion yet, but already play with Lucidworks 2.8.
The native embedded crawler for Lucidworks is Apperture [0].
IMHO nutch is better than Apperture in terms of stability, speed and
features.
[0] http://sourceforge.net/projects/aperture/
On Wed, Oct 1, 2014 at 9:19 AM, Jorge Luis
to use for
testing recrawl? maybe I do some steps wrong.
Regards.
On Fri, Jun 6, 2014 at 7:01 PM, Bayu Widyasanyata bwidyasany...@gmail.com
wrote:
Just curious, I will go back in lab and proof it
---
wassalam,
[bayu]
/sent from Android phone/
On Jun 6, 2014 5:37 PM, Ali
helps,
Sebastian
On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote:
Hi,
I'm sure this is an old topic, but I still no luck crawling with it.
It's a little bit harder than crawling web / http protocol :(
Following are some important files I configured:
(1) urls/seed.txt
file
Hi Ali,
This blog [0] may helps.
[0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com wrote:
Thank you very much. But it is just a parameter for specifying the interval
between re-crawls. The problem is
mentioned.
Regards.
On Fri, Jun 6, 2014 at 2:14 PM, Bayu Widyasanyata bwidyasany...@gmail.com
wrote:
Hi Ali,
This blog [0] may helps.
[0] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
On Thu, Jun 5, 2014 at 12:32 AM, Ali Nazemian alinazem...@gmail.com
Hi,
I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
sources (http protocol).
And now I want add file share data sources (file protocol) into current
crawldb.
What is the strategy or common practices to handle this situations?
Thank you.-
--
wassalam,
[bayu]
Hi Markus,
The following files should I configured:
= prefix-urlfilter.txt: put file:// which is already configured.
= regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
-^(ftp|mailto):
= urls/seed.txt: add new URL/file path.
...and start crawling.
Is it enough? CMIIW
Thanks-
Hi Markus,
Did you mean I should remove file:// line from prefix-urlfilter.txt?
When I checked with command: bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined urls/seed.txt, it
returns:
Checking combination of all URLFilters available
-http://www.myurl.com
OK, thanks! :)
On Wed, Jun 4, 2014 at 8:28 PM, Markus Jelsma markus.jel...@openindex.io
wrote:
ah yes. i am wrong, do not remove it :)
-Original message-
From:Bayu Widyasanyata bwidyasany...@gmail.com
Sent:Wed 04-06-2014 15:25
Subject:Re: Crawling web and intranet files into
Hi,
I'm sure this is an old topic, but I still no luck crawling with it.
It's a little bit harder than crawling web / http protocol :(
Following are some important files I configured:
(1) urls/seed.txt
file://opt/searchengine/test/
which contains one file:
-rw-r--r-- 1 bayu bayu 3272 Jun 5
at http://manifoldcf.apache.org? Might be a better fit for
what you are describing. Not sure it does parsing though.
On 23 May 2014 11:08, Bayu Widyasanyata bwidyasany...@gmail.com wrote:
Hi,
Anyone could pointing me on documentation how to pull in (fetching) data
from database (e.g
Hi Martin,
Just put and serves as common web server files inside their docroot.
If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).
Hope it helps.
---
wassalam,
[bayu]
/sent from Android phone/
On May 24, 2014 7:16 PM, Martin
Hi,
Anyone could pointing me on documentation how to pull in (fetching) data
from database (e.g. common RDBMS such MySQL, etc.) with nutch?
While the rest of process are nutch commons: parse and index them.
Thanks in advance.
--
wassalam,
[bayu]
Done! Great Julien!
On Wed, May 21, 2014 at 10:58 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Great! Done! :-)Julien Nioche lists.digitalpeb...@gmail.com schreef:Hi
everyone!
I had written a survey about Nutch and its uses and would be very grateful
if you could take a couple of
.
Thanks!
Julien
On 15 May 2014 05:29, Bayu Widyasanyata bwidyasany...@gmail.com wrote:
Hi All,
I want to run deduplications data on nutch 1.8 using command: nutch dedup
solr_URL since nutch solrdedup command is not supported anymore on
1.8.
But this command raised error:
2014-05-15
You're welcome! Great!
On Sat, May 10, 2014 at 1:56 AM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi Bayu,
it's fixed now. Thanks!
Sebastian
On 05/06/2014 12:28 AM, Bayu Widyasanyata wrote:
Hi,
I think a mirror typo on this page [0] regarding latest Tika included
itself?
Thanks
Paul
On 5 May 2014 18:57, Bayu Widyasanyata bwidyasany...@gmail.com wrote:
On Tue, May 6, 2014 at 6:05 AM, Paul Rogers paul.roge...@gmail.com
wrote:
By that do you mean using file:// as opposed to http:// crawling?
Yupe.
https://wiki.apache.org/nutch/FAQ
that excludes directories (and their
listings) but includes any files in them.
Thanks
P
On 19 May 2014 09:31, Bayu Widyasanyata bwidyasany...@gmail.com wrote:
Hi Paul,
Apologize for late reply since I have another tasks should be finished.
The common practice if your website is common
Hi All,
I want to run deduplications data on nutch 1.8 using command: nutch dedup
solr_URL since nutch solrdedup command is not supported anymore on 1.8.
But this command raised error:
2014-05-15 11:19:59,334 INFO crawl.DeduplicationJob - DeduplicationJob:
starting at 2014-05-15 11:19:59
Hi,
I think a mirror typo on this page [0] regarding latest Tika included on
nutch 1.8.
It was written includes library upgrade to Apache Tika 1.4 while on detail
changes it was actually Tika 1.5 [1] or this [2].
Thanks.-
[0]
I also experienced the same thing [checksum error] :(
I couldn't avoid to delete segment and do refetch again...
Deleting .crc files, or other files inside segments didn't help much.
Thanks.-
On Tue, May 6, 2014 at 2:55 AM, Sebastian Nagel
wastl.na...@googlemail.comwrote:
Caused by:
On Mon, May 5, 2014 at 10:34 PM, Paul Rogers paul.roge...@gmail.com wrote:
My question is how do I get nutch to crawl all the files on a web site not
just the root url?
Hi,
nutch is acts as crawler, the same about we uses any Internet browser.
nutch or we can't browse or crawl the pages that
On Tue, May 6, 2014 at 6:05 AM, Paul Rogers paul.roge...@gmail.com wrote:
By that do you mean using file:// as opposed to http:// crawling?
Yupe.
https://wiki.apache.org/nutch/FAQ#Nutch_crawling_parent_directories_for_file_protocol
--
wassalam,
[bayu]
Hi Shane,
The regex-urlfilter.txt will exclude someurl.com when you do a/multiple
cycle of inject generate fetch parse update solrupdate process.
The regex-urlfilter.txt will also affects on updatedb and solrindex
steps with -filter as parameter applied.
Regards,
On Thu, Apr 3, 2014 at
Hi,
Have you check the hadoop.log?
Hi,
Sometimes we accidentally crawls unneeded URLs format until push them into
last solrindex step.
As we know we can drop or delete those URLs by add regex on
regex-urlfilter.txt and do nutch updatedb. Then those URL will be
dropped/deleted from crawldb database.
But, how to ensure URLs that
I just fixed the pattern with following:
-^http://.*ccm_paging_p.*$
And put it before.
Case closed.
Thank you Tejas!
On Sat, Feb 15, 2014 at 8:53 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi Tejas,
You're right!
It's my mistake! regex-urlfilter.txt problems.
It starts when I
Hi,
From what I know that nutch generate will create a new segment directory
every round nutch is running.
I have a problem (never happened before) that nutch won't create new
segment.
It always only fetch and parse the latest segment.
- from the logs:
2014-02-15 07:20:02,036 INFO
Yupe, thanks!
---
wassalam,
[bayu]
/sent from Android phone/
On Feb 2, 2014 10:51 PM, Tejas Patil tejas.patil...@gmail.com wrote:
On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi Tejas,
It's works and great! :)
After reconfigured and many times
this is verified and everything looks good from the crawling
side, run solrindex and check if you get the query results. If not, then
there was a problem while indexing the stuff.
Thanks,
Tejas
On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi,
I just
=... ...
That's no a system property because the argument -D... comes after
the class to be run. Most (if not all) Nutch tools/commands use
ToolRunner.run()
which supports generic options (among them -Dproperty=value).
Sebastian
On 11/01/2013 12:54 AM, Bayu Widyasanyata wrote:
Hi,
One more
(see comments in bin/nutch
):
NUTCH_HEAPSIZE (in MB)
NUTCH_OPTS Extra Java runtime options
export NUTCH_HEAPSIZE=2048
should work but also
export NUTCH_OPTS=-Xmx2048m
The latter one would allow to add more Java options separated by space.
Sebastian
2013/10/30 Bayu
On Thu, Oct 31, 2013 at 8:43 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi Sebastian,
Thanks for the hint.
---
wassalam,
[bayu]
/sent from Android phone/
On Oct 30, 2013 7:54 PM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi,
the script bin/crawl executes bin/nutch
documents from solr.
On Wed, Oct 2, 2013 at 7:24 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi,
One of my seed URL was changed to new CMS which affect to its URI
presentation format.
How could I delete the old format of CMS on Solr database, then I could
recrawl and reindex again
Hi,
I'm sure it's an old question..
I just want protecting Admin page (/solr) with Basic Authentication.
But I can't found fine answer yet out there.
I use Solr 4.1 with Apache Tomcat/7.0.35.
Could anyone give me a quick hints or links?
Thanks in advance!
--
wassalam,
[bayu]
Ooops.. apologize to wrong posting here! :(
It should be to solr-user group.
On Fri, Feb 8, 2013 at 2:18 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi,
I'm sure it's an old question..
I just want protecting Admin page (/solr) with Basic Authentication.
But I can't found fine answer
On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Did you check the http.accept property in nutch-site.xml
I copied from nutch-default.xml, then add application/pdf:
property
namehttp.accept/name
%20Utama%20Daripada%20Dunia.pdf
-
Url
---
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
-
Metadata
-
xmp:CreatorTool : Writer
meta:author : Bayu Widyasanyata
xmpTPg:NPages : 1
dc:creator : Bayu Widyasanyata
Content-Type
On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote:
Well, if you know that the front page is updated frequently, set
db.fetch.interval.default to lower value so that urls will be eligible
for re-fetch sooner. By default, if a url is fetched successfully, it
becomes
know what should the correct filename of
the jar file.
mysql.jar or should named mysql-connector-java.jar??
Which is nutch will call/refer?
On Tue, Jan 8, 2013 at 2:47 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi Lewis,
Thanks for the link!
On Tue, Jan 8, 2013 at 6:11 AM, Lewis John
Yes, I forgot that things even I already put on my notes on previous
installation.
I'm quite new on nutch and also Java developments :)
Thanks!
On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
java.io.IOException: java.lang.ClassNotFoundException:
, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Yes, I forgot that things even I already put on my notes on previous
installation.
I'm quite new on nutch and also Java developments :)
Thanks!
On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote
For clarity, the log below is the about 4 of 5 my PDF docs that can't be
parsed by nutch.
On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
nutch parsing is still problem on pdf files.
Only 1 pdf can be parsed successfully.
2013-01-11 08:11:23,679 WARN
Hi Lewis,
Thanks for the link!
On Tue, Jan 8, 2013 at 6:11 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Bayu,
On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Anyone can give me a hint?
In parallel I changed to use nutch 1.6 binary
?
In parallel I changed to use nutch 1.6 binary and works well.
But curious to use the latest of nutch 2.1.
Thanks in advance!
On Sun, Dec 30, 2012 at 1:46 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Hi,
Thank you for suggestions.
And I was try to upgrade the Tika to 1.2 as mentioned
Problem fixed :)
Many thanks!
On Sun, Jan 6, 2013 at 9:15 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
I think it was the problem, on my nutch-site.xml
property
namegenerate.max.per.host/name
value100/value
/property
eventhough it's deprecated.
OK, I
, so unless there is something peculiar
with all your files or setup, have you tried the:
- Size of the files to see if they are over configured limits
- used the nutch parsechecker command to test individual files
Cheers,
Dave
On 25 Dec 2012, at 01:34, Bayu Widyasanyata bwidyasany
Hi All,
I'm a new on nutch and solr, with following platforms:
- nutch 2.1
- solr 4.0
- jdk 1.7 on ubuntu 10.04
I'm also part of member of the legendary implementation nutch with
MySQL at http://nlp.solutions.asia/?p=180 ;-)
I have installed all of above successfully with some minors
corrections
#Portable_Document_Format
Thanks,
On Tue, Dec 25, 2012 at 7:16 AM, Bayu Widyasanyata
bwidyasany...@gmail.com wrote:
Hi All,
I'm a new on nutch and solr, with following platforms:
- nutch 2.1
- solr 4.0
- jdk 1.7 on ubuntu 10.04
I'm also part of member of the legendary implementation nutch with
MySQL
49 matches
Mail list logo