Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

HLPPP !!!

Stuck for 3 days on not able to start any nutch job.

hdfs works fine, ie I can put  look at files.
When i start nutch crawl, I get the following error

Job initialization failed:
java.lang.IllegalArgumentException: Pathname
/d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls

It is looking for the file at a wrong location  Indeed in my case the
correct location is /d:/Bii/nutch/logs/history, so why is *
history/user/_logs* added and how can I fix that ?

2009/6/21 MilleBii mille...@gmail.com


Looks like I just needed to transfer from the local filesystem to hdfs:
Is it safe to transfer a crawl directory (and subs) from the local file
system to hdfs and start crawling again ?

1. hadoop fs -put crawl crawl
2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it
should use the hdfs)

-MilleBii-

2009/6/21 MilleBii mille...@gmail.com

 I have newly installed hadoop in a distributed single node configuration.

When I run nutch commands  it is looking for files my user home directory
and not at the nutch directory ?
How can I change this ?


I suspect your hadoop-site.xml uses relative path somewhere, and not an 
absolute path (with leading slash). Also, /d: looks suspiciously like a 
Windows pathname, in which case you should either use a full URI 
(file:///d:/) or just the disk name d:/ without the leading slash. 
Please also note that if you are running this on Windows under cygwin 
then in your config files you MUST NOT use the cygwin paths (like 
/cygdrive/d/...) because Java can't see them.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki

MilleBii wrote:

What's also i have discovered
+ hadoop (script) works with unix like paths and works fine on windows
+ nutch (script) works with Windows paths


bin/nutch works with Windows paths? I think this could happen only by 
accident - both scripts work with Cygwin paths. On the other hand, 
arguments passed to JVM must be regular Windows paths.




Could it be that there is some incompatibility because one works unix like
paths and not the other ???


Both scripts work fine for me on Windows XP + Cygwin, without any 
special settings - I suspect there is something strange in your 
environment or config...


Please note that Hadoop and Nutch scripts are regular shell scripts, so 
they are aware of Cygwin path conventions, in fact they don't accept 
un-escaped Windows paths as arguments (i.e. you need to use forward 
slashes, or you need to put double quotes around a Windows path).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[ANN] Luke + Hadoop, alpha version

2009-07-10 Thread Andrzej Bialecki

Hi all,

I prepared a special edition of Luke, the Lucene Index Toolbox, that
works with Lucene indexes located on any filesystem supported by Hadoop
0.19.1.

At the moment I'm looking for feedback how to best integrate this
functionality with various bits and pieces of Luke. You can download the
jar file from a direct link:

http://www.getopt.org/luke/lukeall-0.9.3.jar

This JAR contains all dependencies needed to connect to HDFS, KFS or
S3/S3n filesystems, although I tested it only with HDFS so far.

Note: this version of Luke still uses Lucene 2.4.1, I didn't start
integrating 2.9-dev yet.

Quick info for the impatient: yes, you can browse the content, view
terms and documents, perform searching, explaining, etc. See below for
more details.

The initial Open dialog is not integrated yet with this functionality.
After you start Luke, you need to dismiss this dialog, go to Plugins /
Hadoop Plugin, and enter the full URI of the index in the textfield, and
then press the Open button. There is no filesystem browsing for now -
you need to know the full URI in advance.

Current functionality is as follows:

- you can open a single index or partial (sharded) indexes located in
part-N/ subdirectories (this is a typical layout resulting from
using common map-reduce output formats). In the latter case you will get
a single view of partial indexes, thanks to MultiReader.

- access is read-only - most FileSystem-s don't support file updates, so
it was easiest to disable write access altogether for now.

- most of Luke functionality works properly, thanks to the excellent
design of IndexReader API. Some operations are disabled due to read-only
access, some other information (like top terms) is not populated by
default due to a high IO cost, but can be requested explicitly.

- the plugin keeps track of the amount of IO reads - I found this very
comforting when opening large indexes over a slow VPN line ... There is
a Clear button on the plugin's tab that resets the counters - this is
useful to see how much IO is needed to complete a specific operation.

- a lot of code has been reworked to avoid UI stalls when doing slow IO,
which means that you can see the amount of IO being done, but the UI is
blocked with a modal dialog. It's a bit unwieldy, but other solutions
would require too much refactoring.

Any feedback is welcome - please keep in mind that this is an early
preview. Also, various UI glitches are probably related to the Thinlet
toolkit - again, one day I may re-write Luke using something else, but
for now I don't have the strength to do it.  :)




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2009-07-12 Thread Andrzej Bialecki

lei wang wrote:

anyone help? so disappointed.

On Fri, Jul 10, 2009 at 4:29 PM, lei wang nutchmaill...@gmail.com wrote:


Yes, I am also occuring to  this problem. Can anyone help?


On Sun, Jul 5, 2009 at 11:33 PM, xiao yang yangxiao9...@gmail.com wrote:


I often get this error message while crawling the intranet
Is it the network problem? What can I do for it?

$bin/nutch crawl urls -dir crawl -depth 3 -topN 4

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 4
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090705212324
Generator: filtering: true
Generator: topN: 4
Generator: Partitioning selected urls by host, for politeness.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
   at org.apache.nutch.crawl.Generator.generate(Generator.java:524)
   at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)







If you are running a large crawl on a single machine, you could be 
running out of file descriptors - please check ulimit -n, the value 
should be much much larger than 1024.


Also, please check the hadoop.log for clues why shuffle fetching failed 
- this could be something trivial as a blocked port, or routing problem, 
or DNS resolution problem, or the problem I mentioned above.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Why cant I inject a google link to the database?

2009-07-17 Thread Andrzej Bialecki

Brian Ulicny wrote:

1. Save the results page.
2. Grep the links out of it.
3. Put the results in a doc in your urls directory
4. Do: bin/nutch crawl urls 


Please note, we are not saying this is impossible to do this with Nutch 
(e.g. by setting the agent string to mimick a browser), but we insist on 
saying that it's RUDE to do this.


Anyway, Google monitors such attempts and after you issue too many 
requests your IP will be blocked for a duration - so no matter if you go 
the polite or the impolite way you won't be able to do this.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch -threads in hadoop

2009-07-23 Thread Andrzej Bialecki

Brian Tingle wrote:

Hey,

 

I'm playing around the nutch on hadoop; when I go 


hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl -threads ... is
that threads per node or total threads for all nodes?


Threads per map task - if you run multiple map tasks per node then you 
will get numThreads * numMapTasks per node.


So be careful to set it to a number that doesn't overwhelm your network ;)

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch -threads in hadoop

2009-07-24 Thread Andrzej Bialecki

Brian Tingle wrote:

Thanks, I eventually found where the job trackers were in the :50030 web
page of the cloudera thing, and I saw it said 10 threads for each
crawler in the little status update box where it was telling me how far
along each crawl was.  I have to say, this whole thing (nutch/hadoop) is
pretty flipping awesome.  Great work.

I'm running on aws EC2 us-east and spidering sites that should be hosted
on the CENIC network in California, do you have any suggestions on what
a good number of threads to try per crawler might be in that situation
(I'm guessing it might be hard to saturate the bandwidth)?  I'm thinking
I'll bump it up to at least 25.


You need to be careful when running large crawls on someone else's 
infrastructure. While the raw bandwidth may be enough, the DNS infra may 
be insufficient - both on the side of the target domains as well as the 
local resolver. I strongly recommend setting up a local caching DNS.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Andrzej Bialecki

Alex McLintock wrote:

I am not sure if it solves your problem but you might do something
like disconnect your machines from the internet - preferably by making
your dns server return dont know that domain

This will relatively quickly cause the remaining part of the fetch to fail.

Just a suggestion...


I solved this once by implementing a check in Fetcher.run() for a marker 
file on HDFS. If the presence of this file was detected, the 
FetcherThreads would be stopped one by one (again, by setting a flag in 
their run() methods to terminate the loop).


It's a hack but it works well.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Host specific parsing

2009-07-28 Thread Andrzej Bialecki

Koch Martina wrote:

Hi,

has anyone built a parsing plugin which decides on a per host basis how the 
content of the document should be parsed?

For example, if the title of a document is in the first h1-tag of a page for host1 
, but the title for a document of host2 is in the third h2-tag, the plugin would 
extract the title differently depending on the host.

In my opinion something like a dispatcher plugin would be needed:

-  Identify host of a document

-  Read and cache instructions on how to get the information for that 
host (database or config file)

-  Execute host-specific plugin

Do you have any suggestions on how to implement such a scenario efficiently? 
Has anyone implemented something similiar and can point out possible 
performance issues or other critical issues to be considered?


Yes, and yes. With the current plugin system you can create a new 
dispatcher plugin, and then add other necessary plugins as import 
elements. This way they will be accessible from the same classloader, so 
that you can instantiate them directly in your dispatcher plugin.


As for the lookup ... many solutions are possible. DB connections from 
map tasks may be problematic, both because of latency and the cost of 
setting up so many DB connections. OTOH, if you add local caching (using 
JCS or Ehcache) the hit/miss ratio should be decent enough. If the 
mapping of host names to plugins can be expressed by rules then maybe a 
simple rule set would be enough.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Meaning of ProtocolStatus.ACCESS_DENIED

2009-08-03 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

I don't know of an elegant way, but if you want to hack Nutch
sources, you could set its refetch time to some point in time
veeey far in the future, for example.  Or introduce additional
status.


This won't work, because the pages will be checked again after a 
maximum.fetch.interval.


Pages that return ACCESS_DENIED may do so only for some time, so Nutch 
needs to check their status periodically. In a sense, no page is ever 
truly GONE, if only for the reason that we somehow need to represent 
nonexistent targets of stale links - if we removed these URLs from the 
db they would be soon rediscovered and added again.


The gory details of maximum.fetch.interval follow .. Nutch periodically 
checks the status of all pages in CrawlDb, no matter what their state, 
including GONE, ACCESS_DENIED, ROBOTS_DENIED, etc. If you use some 
adaptive re-fetch strategy (AdaptiveFetchSchedule) then the re-fetch 
interval will be set at maximum value in a few cycles, so the checking 
won't occur too often. You may be tempted to set this to infinity, i.e. 
to never check these URLs again. However, the purpose of having a 
specific value for maximum refetch interval is to be able to phase out 
old segments, so that you can be sure that you can delete old segments 
after N days, because all their pages have been surely scheduled for 
refetching and will be found in a newer segment.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch updatedb Crash

2009-08-16 Thread Andrzej Bialecki

MoD wrote:

Julien,

I did tryed with 2048M / Task child,
no luck I still have two reduce that doesn't go through,

Is it somewhat related to the number of reduce,
on this cluster I have 4 servers :
- dual xeon dual core (8 core)
- 8Gb ram
- 4 disks

I did set mapred.reduce.tasks and mapred.map.tasks to 16.
because : 4 server of 4 disks. (what do you think)

Maybe if this job is too big for my cluster, does adding reduce task
could subdivise the problem into smaller reduces.
indeed I think no, cause I guess the input key is for the same domain ?

so my two last reduce task are the biggest domains of my DB ?


This is likely caused by a large number of inlinks for certain urls - 
the updatedb reduce collects this list in memory, and this sometimes 
leads to memory exhaustion. Please try limiting the max. number of 
inlinks per url (see nutch-default.xml for details).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch.SIGNATURE_KEY

2009-08-22 Thread Andrzej Bialecki

Paul Tomblin wrote:

On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote:

Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?

I believe the hashes would be the same for either raw MD5 or text signature,
yes. So on the search side these would get collapsed. Don't know about what
else you mean as far as same page - e.g. one entry in the CrawlDB? If so,
then somebody else with more up-to-date knowledge of Nutch would need to
chime in here. Older versions of Nutch would still have these as separate
entries, FWIR.


Actually, I just checked some of my own pages, and http://xcski.com/
and http://xcski.com/index.html have different signatures, in spite of
them being the same page.  So I guess the answer to that is no, even
if there were logic to make them the same page in CrawlDB, it wouldn't
work.


There is nothing magic about the process of calculating a signature - 
eg. MD5Signature just takes Content.getContent() (array of bytes) and 
runs it through MD5. So if you get different MD5 values, then your 
content was indeed different (even if it was only an advertisement link 
somewhere on the page).


You could use urlnormalizer to collapse www.example.com/ and 
www.example.com/index.html into a single entry, in fact there is a 
commented-out rule like that in urlnormalizer config file. But as you 
observed above, there may be cases when these two are not really the 
same page, so you need to be careful ...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki

BELLINI ADAM wrote:


me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links 
to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files


I'd argue with this advice. The goal here is to obtain the HTML pages. 
If you have crawled them, then why do it again? You already have their 
content locally.


However, page content is NOT stored in crawldb, it's stored in segments. 
So you need to dump the content from segments, and not the content of 
crawldb.


The command 'bin/nutch readseg -dump segmentName output' should do 
the trick.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: how to upgrade a java application with nutch?

2009-10-01 Thread Andrzej Bialecki

Jaime Martín wrote:

Hi!
I´ve a java application that I would like to upgrade with nutch. What jars
should I add to my lib applicaction to make it possible to use nutch
features from some of my app pages and business logic classes?
I´ve tried with nutch-1.0.jar generated by war target without success.
I wonder what is the proper nutch build.xml target I should execute for this
and what of the generated jars are to be included in my app. Maybe apart
from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
them?
thanks in advance!



Nutch is not designed for embedding in other applications, so you may 
face numerous problems. I did such an integration once, and it was far 
from obvious. A lot depends also whether you want to run it on a 
distributed cluster or in a single JVM (local mode).


Take a look at build/nutch*.job, it's a jar file that contains all 
dependencies needed to run Nutch except for Hadoop libraries (which are 
also required).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki

tsmori wrote:

This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:

http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site. 


What should I look for to find out why this is happening?




* Check that the pages there are not forbidden by robot rules (which may 
be embedded inside HTML meta tags of index.html, or the top-level 
robots.txt).


* check that your crawldb actually contains entries for these pages - 
perhaps they are being filtered out.


* check your segments whether these URLs were scheduled for fetching, 
and if so, then what was the status of fetching.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,
but how to dump the content  ? i tried this command :



./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto

and it said :

Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
  


but the crawl_generate is in this path :

/usr/local/nutch-1.0/crawl/segments/20091001120102

and not in this one :

/usr/local/nutch-1.0/crawl/segments/20091001120102/content

can you plz just give me the correct command ?


This command will dump just the content part:

./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch 
-nogenerate -noparse -noparsedata -noparsetext


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki

tsmori wrote:

Both good ideas. Unfortunately, the content for each user is the same. It's a
static php file that simply calls information out of our LDAP.

It's very strange because I cannot see any difference between the user
files/directories that are fetched and those that aren't. In checking both
the crawl log and the hadoop log, the missing users are not even fetched. 


Check the segment's crawl_generate and crawl_fetch, and also check your 
crawldb for status. Logs don't always contain this information.



The issue seems to be that they're not fetched and there's no indication in
the logs why they aren't.


See above.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links for Crawling

2009-10-05 Thread Andrzej Bialecki

Eric wrote:
Does anyone know if it possible to target only certain links for 
crawling dynamically during a crawl? My goal would be to write a plugin 
for this functionality but I don't know where to start.


URLFilter plugins may be what you want.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki

Eric wrote:
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can 
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's 
then crawl the links generated from the TLD's in increments of 100K?


Yes. Make sure that you have the generate.update.db property set to 
true, and then generate 16 segments each having 100k urls. After you 
finish generating them, then you can start fetching.


Similarly, you can do the same for the next level, only you will have to 
generate more segments.


This could be done much simpler with a modified Generator that outputs 
multiple segments from one job, but it's not implemented yet.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki

Eric wrote:

Andrzej,

Just to make sure I have this straight, set the generate.update.db 
property to true then


bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?


Yes. When this property is set to true, then each fetchlist will be 
different, because the records for those pages that are already on 
another fetchlist will be temporarily locked. Please note that this lock 
holds only for 1 week, so you need to fetch all segments within one week 
from generating them.


You can fetch and updatedb in arbitrary order, so once you fetched some 
segments you can run the parsing and updatedb just from these segments, 
without waiting for all 16 segments to be processed.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-06 Thread Andrzej Bialecki

Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page 
and then at that point choose which links I want to include / exclude? 
that is the ideal remedy to my problem.


Yes, look at ParseOutputFormat, you can make this decision there. There 
are two standard etension points where you can hook up - URLFilters and 
ScoringFilters.


Please note that if you use URLFilters to filter out URL-s too early 
then they will be rediscovered again and again. A better method to 
handle this, but also more complicated, is to still include such links 
but give them a special flag (in metadata) that prevents fetching. This 
requires that you implement a custom scoring plugin.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-07 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

How would I check for a flag during fetch?


You would check for a flag during generation - please check 
ScoringFilter.generatorSortValue(), that's where you can check for a 
flag and set the sort value to Float.MIN_VALUE - this way the link will 
never be selected for fetching.


And you would put the flag in CrawlDatum metadata when ParseOutputFormat 
calls ScoringFilter.distributeScoreToOutlinks().




Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but 
still needing a total of X links per page, if I find the links I want, I 
add them to the list up until X, if I don' reach X, I add other links 
until X is reached. This way, I don't waste crawl time on non-relevant 
links.


You can modify the collection of target links passed to 
distributeScoreToOutlinks() - this way you can affect both which links 
are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out unwanted 
links, but that API gives you much less control because it's a simple 
yes/no that considers just URL string. The advantage is that it's much 
easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: indexing just certain content

2009-10-09 Thread Andrzej Bialecki

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented - for 
example the creativecommons plugin. I believe that's exactly the 
functionality that you are looking for.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki

MilleBii wrote:

Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain  standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.


Actually, creativecommons provides hints how to do this .. but to be 
more explicit:


* in your HtmlParseFilter you need to extract from DOM tree the parts 
that you want, and put them inside ParseData.metadata. This way you will 
preserve both the original text, and your special parts that you extracted.


* in your IndexingFilter you will retrieve the parts from 
ParseData.metadata and add them as additional index fields (don't forget 
to specify indexing backend options).


* in your QueryFilter plugin.xml you declare that QueryParser should 
pass your special fields without treating them as terms, and in the 
implementation you create a BooleanClause to be added to the translated 
query.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

winz wrote:


Venkateshprasanna wrote:

Hi,

You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans, css classes and the likes - from
each of the web pages. Scalability would obviously be an issue.



Hi,
Could I please know how we can ignore template items like header, footer and
menu/navigations while crawling and indexing pages which follow the same
design??
I'm using a content management system called Infoglue to develop my website.
A standard template is applied for all the pages on the website.

The search results from Nutch shows content from menu/navigation bar
multiple times.
I need to get rid of menu/navigation content from the search result.


If all you index is this particular site, then you know the positions of 
navigation items, right? Then you can remove these elements in your 
HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
elements.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi guyes it's just what im talking about in my post 'indexing
just certain content‏'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own parser and indexer...but the problem is how could we delete those
garbage section from an html...try to read my post...mabe we can
gather our two posts...i dont know if we can gather posts on thsi
mailing list...to keep tracking only one post...


What is garbage? Can you define it in terms of regex pattern or XPath 
expression that points to specific elements in DOM tree? If you crawl a 
single (or few) sites with well defined templates then you can hardcode 
some rules for removing unwanted parts of the page.


If you can't do this, then there are some heuristic methods to solve 
this. There are two groups of methods:


* page at a time (local): this group of methods considers only the 
current page that you analyze. The quality of filtering is usually limited.


* groups of pages (e.g. per site): these methods consider many pages at 
a time, and try to find recurring theme among them. Since you first need 
to accumulate some pages it can't be done on the fly, i.e. this requires 
a separate post-processing step.


The easiest to implement in Nutch is the first approach (page at a 
time). There are many possible implementations - e.g. based on text 
patterns, on visual position of elements, on DOM tree patterns, on 
block of content characteristics, etc.


Here's for example a simple method:

* collect text from the page in blocks, where each block fits within 
structural tags (div and table tags). Collect also the number of a 
links in each block.


* remove a percentage of the smallest blocks, where link number is high 
- these are likely navigational elements.


* reconstruct the whole page from the remaining blocks.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki

Eric Osgood wrote:
Ok, I think I am on the right track now, but just to be sure: the code I 
want is the branch section of svn under nutchbase at 
http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct?


No, you need the trunk from here:

http://svn.apache.org/repos/asf/lucene/nutch/trunk


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki

Eric Osgood wrote:

So the trunk contains the most recent nightly update?


It's the other way around - nightly build is created from a snapshot of 
the trunk. The trunk is always the most recent.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: http keep alive

2009-10-14 Thread Andrzej Bialecki

Marko Bauhardt wrote:

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in 
configuration files?


protocol-httpclient can support keep-alive. However, I think that it 
won't help you much. Please consider that Fetcher needs to wait some 
time between requests, and in the meantime it will issue requests to 
other sites. This means that if you want to use keep-alive connections 
then the number of open connections will climb up quickly, depending on 
the number of unique sites on your fetchlist, until you run out of 
available sockets. On the other hand, if the number of unique sites is 
small, then most of the time the Fetcher will wait anyway, so the 
benefit from keep-alives (for you as a client) will be small - though 
there will be still some benefit for the server side.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Enterprise

2009-10-17 Thread Andrzej Bialecki

Dennis Kubes wrote:
Depending on what you are wanting to do Solr may be a better choice as 
and Enterprise search server. If you are needing crawling you can use 
Nutch or attach a different crawler to Solr.  If you are wanting to do 
more full web type search, then Nutch is a better option.  What are your 
 requirements?


Dennis

fredericoagent wrote:
Does anybody have any information on using Nutch as Enterprise search 
?, and

what would I need ?
is it just a case of the current nutch package or do you need other 
addons.


And how does that compare against Google Enterprise ?

thanks


I agree with Dennis - use Nutch if you need to do a larger-scale 
discovery such as when you crawl the web, but if you already know all 
target pages in advance then Solr will be a much better (and much easier 
to handle) platform.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-17 Thread Andrzej Bialecki

Jesse Hires wrote:

Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it expected
that this shows up in the logs from time to time? If it is not expected,
where can I look for more information on what is going on?

2009-10-16 17:02:43,061 ERROR datanode.DataNode -
DatanodeRegistration(192.168.1.7:50010,
storageID=DS-1226842861-192.168.1.7-50010-1254609174303,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_90983736382565_3277 is valid, and cannot be written to.


Are you sure you are running a single datanode process per machine?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki

Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the
topN-statement, there are still only some 50 to 75 files indexed.


Check in your nutch-site.xml what is the value of 
db.max.outlinks.per.page, the default is 100 - when crawling filesystems 
each file in a directory is treated as an outlink, and this limit is 
then applied.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread Andrzej Bialecki

malcolm smith wrote:

I am looking to create a parser for a groupware product that would read
pages message board type web site.  (Think phpBB).  But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
original post) and an additional document for each reply item.

I've reviewed the code for protocol plugins, parser plugins and indexing
plugins but each interface allows for a single document or content object to
be passed around.

Am I missing something simple?

My best bet at the moment is to implement some kind of new fake protocol
for the reply items then I would use the http client plugin for the first
request to the page and generate outlines on the
fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go
back through and fetch the sub page content.  But this seems round-about and
would probably generate an http request for each reply on the original
page.  But perhaps there is a way to lookup the original page in the segment
db before requesting it again.

Needless to say it would seem more straightforward to tackle this in some
kind of parser plugin that could break the original page into pieces that
are treated as standalone pages for indexing purposes.

Last but not least conceptually a plugin for the indexer might be able to
take a set of custom meta data for a replies collection and index it as
separate lucene documents - but I can't find a way to do this given the
interfaces in the indexer plugins.

Thanks in advance
Malcolm Smith


What version of Nutch are you using? This should be already possible to 
do using the 1.0 release or a nightly build. ParseResult (which is what 
parsers produce) can hold multiple Parse objects, each with its own URL.


The common approach to handle whole-part relationships (like zip/tar 
archives, RSS, and other compound docs) is to split them in the parser 
and parse each part, then give each sub-document its own URL (e.g 
file.tar!myfile.txt) and add the original URL in the metadata, to keep 
track of the parent URL. The rest should be handled automatically, 
although there are some other complications that need to be handled as 
well (e.g. don't recrawl sub-documents).




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki

Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than 400K 
files at a time using a 4 node hadoop cluster running nutch 1.0.


org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
create file 
/user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index 
for DFSClient_attempt_200910131302_0011_r_15_2 on client 
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me if 
this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed that 
it fixes this issue).




Can anybody shed some light on this issue? I was under the impression 
that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln 
pages in various configs and it didn't occur to me personally. It 
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing an Index from a shared location

2009-10-21 Thread Andrzej Bialecki

JusteAvantToi wrote:

Hi all,

I am new on using Nutch and I found that Nutch is really good. I have a
problem and hope somebody can shed a light. 


I have built an index and a web application that makes use of that index. I
plan to have two web application servers running the application. Since I do
not want to replicate the application and the index  on each web application
server, I put the application and the index on a shared location and
configure nutch-site.xml as follow:

property
  namesearcher.dir/name
  value\\111.111.111.111\folder\index/value
  description Path to root of crawl/description
/property

property
  nameplugin.folders/name
  value\\111.111.111.111\folder\plugins/valuedescription
/property

However it seems that my application can not find the index. I have checked
that the web application server have access to the shared location. 


Is there something that I missed here? Does Nutch allow us to put the index
on a network location?


UNC paths are not supported in Java - you need to mount this location as 
a local volume.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-23 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

Based on what you suggested below, I have begun to write my own scoring 
plugin:


Great!



in distributeScoreToOutlinks() if the link contains the string im 
looking for, I set its score to kept_score and add a flag to the 
metaData in parseData (KEEP, true). How do I check for this flag in 
generatorSortValue()? I only see a way to check the score, not a flag.


The flag should have been automagically added to the target CrawlDatum 
metadata after you have updated your crawldb (see the details in 
CrawlDbReducer). Then in generatorSortValue() you can check for the 
presence of this flag by using the datum.getMetaData().


BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any 
special way ... I thought it did. It's easy to add this, though - in 
Generator.java:161 just add this:


if (sort == Float.MIN_VALUE) {
return;
}


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Andrzej Bialecki

Gora Mohanty wrote:

Hi,

  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?

  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.


Hi,

Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are 
kept in Nutch crawldb to prevent their re-discovery (through stale links 
pointing to these URL-s from other pages). If you really want to remove 
them from CrawlDb you can filter them out (using CrawlDbMerger with just 
 one input db, and setting your URLFilters appropriately).


Now when it comes to removing them from Solr ... The simplest (no 
coding) way would be to dump the CrawlDb, use some scripting tools to 
collect just the URL-s with the status GONE, and send them as a delete 
command to Solr. A slightly more involved solution would be to implement 
a tool that reads such URLs directly from CrawlDb (using e.g. 
CrawlDbReader API) and then uses SolrJ API to send the same delete 
requests + commit.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Andrzej Bialecki

Gora Mohanty wrote:

On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]

Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).

[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.


I assume you mean that the generate step produces no new URL-s to 
fetch? That's expected, because they become eligible for re-fetching 
only after Nutch considers them expired, i.e. after the fetchTime + 
fetchInterval, and the default fetchInterval is 30 days.


You can pretend that the time moved on using the -adddays parameter. 
Then Nutch will generate a new fetchlist, and when it discovers that the 
page is missing it will mark it as gone - actually, you could then take 
that information directly from the Nutch segment and instead of 
processing the CrawlDb you could process the segment to collect a 
partial list of gone pages.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki

Dmitriy Fundak wrote:

If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.


It's possible to do this with a custom indexing filter - see other 
indexing filters to get a feeling of what's involved. Or you could do 
this with a scoring filter, too, although the scoring API looks more 
complicated.


Either way, when you execute the Indexer, these filters are run in a 
chain, and if one of them returns null then that document is discarded, 
i.e. it's not added to the output index. So, it's easy to examine in 
your indexing filter the content type (or just a URL of the document) 
and either pass the document on or reject it by returning null.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki

caezar wrote:

Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
  return; // only have inlinks
}
my page has fetchDatum, parseText and parseData not null, but dbDatum is
null. Thats why it's skipped :) 
Any ideas about the reason?


Yes - you should run updatedb with this segment, and also run 
invertlinks with this segment, _before_ trying to index. Otherwise the 
db status won't be updated properly.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: unbalanced fetching

2009-10-29 Thread Andrzej Bialecki

Jesse Hires wrote:

I have a two datanode and one namenode setup. One of my datanodes is slower
than the other, causing the fetch to run significantly longer on it. Is
there a way to balance this out?


Most likely the number of URLs/host is unbalanced, meaning that the 
tasktracker that takes the longest is assigned a lot of URLs from a 
single host.


A workaround for this is to limit the max number of URLs per host (in 
nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever 
works best for you.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: updatedb is talking long long time

2009-11-02 Thread Andrzej Bialecki

Kalaimathan Mahenthiran wrote:

I forgot to add the detail...

The segment i'm trying to do updatedb on has 1.3 millions urls fetched
and 1.08 million urls parsed..

Any help related to this would be appreciated...


On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
matha...@gmail.com wrote:

hi everyone

I'm using nutch 1.0. I have fetched successfully and currently on the
updatedb process. I'm doing updatedb and its taking so long. I don't
know why its taking this long. I have a new machine with quad core
processor and 8 gb of ram.

I believe this system is really good in terms of processing power. I
don't think processing power is the problem here. I noticed that all
the ram is getting using up. close to 7.7gb by the updatedb process.
The computer is becoming is really slow.

The updatedb process has been running for the last 19 days continually
with the message merging segment data into db.. Does anyone know why
its taking so long... Is there any configuration setting i can do to
increase the speed of the updatedb process...


First, this process normally takes just a few minutes, depending on the 
hardware, and not several days - so something is wrong.


* do you run this in local or pseudo-distributed mode (i.e. running a 
real jobtracker and tasktracker?) Try the pseudo-distributed mode, 
because then you can monitor the progress in the web UI.


* how many reduce tasks do you have? with large updates it helps if you 
run  1 reducer, to split the final sorting.


* if the task appears to be completely stuck, please generate a thread 
dump (kill -SIGQUIT) and see where it's stuck. This could be related to 
urlfilter-regex or urlnormalizer-regex - you can identify if these are 
problematic by removing them from the config and re-running the operation.


* minor issue - when specifying the path names of segments and crawldb, 
do NOT append the trailing slash - it's not harmful in this particular 
case, but you could have a nasty surprise when doing e.g. copy / mv 
operations ...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: including code between plugins

2009-11-02 Thread Andrzej Bialecki

Eran Zinman wrote:

Hi,

I've written my own plugin that's doing some custom parsing.

I've needed language parsing in that plugin and the language-identifier
plugin is wokring great for my needs.

However, I can't use the language identifier plugin as it is, since I want
to parse only a small portion of the webpage.

I've used the language identifier functions and it worked great in eclipse,
but when I try to compile my plugin I'm unable to compile it since it
depends on the language-identifier source code.

My question is - how can I include the language identifier code in my plugin
code without actually using the language-identifier plugin?


You need to add the language-identifier plugin to the requires section 
in your plugin.xml, like this:


   requires
  import plugin=nutch-extensionpoints/
  import plugin=language-identifier/
   /requires


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: could you unsubscribe me from this mailing list pls. tks

2009-11-02 Thread Andrzej Bialecki

Nico Sabbi wrote:

Il giorno lun, 02/11/2009 alle 10.04 +0100, Heiko Dietze ha scritto:

Hello,

there is no Administrator. But you can do the unsubscribe your-self. On 
the Nutch Maling-List information site


http://lucene.apache.org/nutch/mailing_lists.html

you can find the following E-Mail address:

nutch-user-unsubscr...@lucene.apache.org

Then your unsubscribe requests should work.

regards,

Heiko Dietze


doesn't work, as reported by me and others last week.
Thanks,


Did you get the message with the subject of confirm unsubscribe from 
nutch-user@lucene.apache.org and did you respond to it from the same 
email account that you were subscribed from?




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)

2009-11-02 Thread Andrzej Bialecki

Andrzej Bialecki wrote:


doesn't work, as reported by me and others last week.
Thanks,


Did you get the message with the subject of confirm unsubscribe from 
nutch-user@lucene.apache.org and did you respond to it from the same 
email account that you were subscribed from?


.. I just verified that this process works correctly - I subscribed and 
unsubscribed successfully. Please make sure that you complete the 
unsubscription process as listed below:


1. make sure you are sending requests from the same email address that 
you were subscribed from!

2. send email to nutch-user-subscr...@lucene.apache.org .
3. you will get a confirm unsubscribe message - make sure your 
anti-spam filters don't block this message, and make sure you are still 
using the correct email account when responding.

4. you need to reply to the confirm unsubscribe message (duh...)
5. you will get a GOODBYE message.

Now, let me understand this clearly: did you go through all 5 steps 
listed above, and you are still getting messages from this list?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Direct Access to Cached Data

2009-11-05 Thread Andrzej Bialecki

Hugo Pinto wrote:

Hello,

I am using Nutch for mirroring, rather than crawling and indexing.
I need to access directly the cached data in my Nutch index, but I am
unable to find an easy way to do so.
I browsed the documentation(wiki, javadocs, and skimmed the code), but
found no straightforward way to do it.
Would anyone suggest a place to look for more information, or perhaps
have done this before and could share a few tips?


Most likely what you need is not the Lucene index, but the segments 
(shards), right? There's a utility called SegmentReader (available from 
cmd-line as readseg), and you can use its API to retrieve either all or 
individual records from a segment (using URL as key).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki
. We should make Nutch an attractive platform 
for such users, and we should discuss what this entails. Also, if we 
refactor Nutch in the way I described above, it will be easier for such 
users to contribute back to Nutch and other related projects.


3. Provide a platform for solving the really interesting issues
---
Nutch has many bits and pieces that implement really smart algorithms 
and heuristics to solve difficult issues that occur in crawling. The 
problem is that they are often well hidden and poorly documented, and 
their interaction with the rest of the system is far from obvious. 
Sometimes this is related to premature performance optimizations, in 
other cases this is just a poorly abstracted design. Examples would 
include the OPIC scoring, meta-tags  metadata handling, deduplication, 
redirection handling, etc.


Even though these components are usually implemented as plugins, this 
lack of transparency and poor design makes it difficult to experiment 
with Nutch. I believe that improving this area will result in many more 
users contributing back to the project, both from business and from 
academia.


And there are quite a few interesting challenges to solve:

* crawl scheduling, i.e. determining the order and composition of 
fetchlists to maximize the crawling speed.


* spam  junk detection (I won't go into details on this, there are tons 
of literature on the subject)


* crawler trap handling (e.g. the classic calendar page that generates 
infinite number of pages).


* enterprise-specific ranking and scoring. This includes users' feedback 
(explicit and implicit, e.g. click-throughs)


* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)

* near-duplicate detection, and closely related issue of extraction of 
the main content from a templated page.


* URL aliasing (e.g. www.a.com == a.com == a.com/index.html == 
a.com/default.asp), and what happens with inlinks to such aliased pages. 
Also related to this is the problem of temporary/permanent redirects and 
complete mirrors.


Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an 
attractive platform to develop and experiment with such components.


-
Briefly ;) that's what comes to my mind when I think about the future of 
Nutch. I invite you all to share your thoughts and suggestions!


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki

fa...@butterflycluster.net wrote:

hi all,

i have an existing index - we have a custom field that needs to be added
or changed in every currently indexed document ;

whats the best way to go about this without recreating the index again?


There are ways to do it directly on the index, but this is complicated 
and involves hacking the low-level Lucene format. Alternatively, you 
could build a parallel index with just these fields, but synchronized 
internal docId-s, open both indexes with ParallelReader, and then create 
a new index using IndexWriter.addIndexes().


I suggest recreating the index.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki

Pablo Aragón wrote:

Hej,

I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.

I download succesfully that code in:
http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. 


After adding it to the project in Eclipse everything seems correct but the
execution shows:

Exception in thread main java.io.IOException: No FileSystem for scheme:
file
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:103)

Any idea?


Yes - when you worked with a pre-built jar it contained an embedded 
hadoop-default.xml that defines the implementation of the file:// 
schema FileSystem. Now you probably forgot to put hadoop-default.xml on 
your classpath. Go to Build Path and add this file to your classpath, 
and all should be ok.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop question

2009-11-13 Thread Andrzej Bialecki

TuxRacer69 wrote:

Hi Eran,

mapreduce has to store its data on HDFS file system.


More specifically, it needs read/write access to a shared filesystem. If 
you are brave enough you can use NFS, too, or any other type of 
filesystem that can be mounted locally on each node (e.g. a NetApp).


But if you want to separate the two groups of servers, you could build 
two separate HDFS filesystems. To separate the two setups, you will need 
to make sure there is no cross communication between the two parts,


You can run two separate clusters even on the same set of machines, just 
 configure them to use different ports AND different local paths.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki

Dharan Althuru wrote:

Hi,


We are trying to incorporate synonym filter during indexing using Nutch. As
per my understanding Nutch doesn’t have synonym indexing plug-in by default.
Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in
available in Lucene using WordNet or custom synonym plug-in without any
negative impacts to existing Nutch indexing (i.e., considering bigram etc).


Synonym expansion should be done when the text is analyzed (using 
Analyzers), so you can reuse the Lucene's synonym filter.


Unfortunately, this happens at different stages depending on whether you 
use the built-in Lucene indexer, or the Solr indexer.


If you use the Lucene indexer, this happens in LuceneWriter, and the 
only way to affect it is to implement an analysis plugin, so that it's 
returned from AnalyzerFactory, and use your analysis plugin instead of 
the default one. See e.g. analysis-fr for an example of how to implement 
such plugin.


However, when you index to Solr you need to configure the Solr's 
analysis chain, i.e. in your schema.xml you need to define for your 
fieldType that it has the synonym filter in its indexing analysis chain.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki

Subhojit Roy wrote:

Hi,

Would it be possible to include in Nutch, the ability to crawl  download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the last
modified timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.


This is already implemented - see the Signature / MD5Signature / 
TextProfileSignature.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Andrzej Bialecki

Yves Petinot wrote:

Hi,

I'm trying to build a small perl (could be any scripting language) 
utility that takes nutch readseg -dump 's output as its input, decodes 
the content field to utf-8 (independent of what encoding the raw page 
was in) and outputs that decoded content. After a little bit of 
experimentation, i find myself unable to decode the content field, even 
when i try using the various charset hints that are available either in 
the content metadata, or in the raw content itself.


I was wondering if someone on the list has already succeeded in building 
this type of functionality, or is the content returned by readseg using 
a specific encoding that i don't know of ?


The dump functionality is not intended to provide a bit-by-bit copy of 
the segment, it's mostly for debugging purposes. It uses System.out, 
which in turn uses the default platform encoding - any characters 
outside this encoding will be replaced by question marks.


If you want to get an exact copy of the raw binary content then please 
use the SegmentReader API.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki

Mark Kerzner wrote:

Hi,

I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?


Your Hadoop cluster does not increase the scalability of the target 
server and that's the crux of the matter - whether you use Hadoop or 
not, multiple threads or a single thread, if you want to be polite you 
will be able to do just 1 req/sec and that's it.


You can prioritize certain pages for fetching so that you get the most 
interesting pages first (whatever interesting means).



I know that URLs from one domain as assigned to one fetch segment, and
polite crawling is enforced. Should I use lower-level parts of Nutch?


The built-in limits are there to avoid causing pain for inexperienced 
search engine operators (and webmasters who are their victims). The 
source code is there, if you choose you can modify it to bypass these 
restrictions, just be aware of the consequences (and don't use Nutch 
as your user agent ;) ).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki

John Martyniak wrote:
Does anybody know of any concrete plans to update Nutch to Hadoop 0.20,  
0.21?


Something like a Nutch 1.1 release, get in some bug fixes and get 
current on Hadoop?


I think that should be one of the goals.

My 2 cents.


I'm planning to do this upgrade soon (~a week) - and I agree that we 
should have a 1.1 release in the near future.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch near future - strategic directions

2009-11-20 Thread Andrzej Bialecki

Sami Siren wrote:

Lots of good thoughts and ideas, easy to agree with.

Something for the ease of use category:
-allow running on top of plain vanilla hadoop


What does it mean plain vanilla here? Do you mean the current DB 
implementation? That's the idea, we should aim for an abstract layer 
that can accommodate both HBase and plain MapFile-s.



-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to 
pull required dependencies for their specific crawler


+1, with slight preference towards ivy.



My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite 
heavy in nature and would require large changes. I am just thinking 
that would it still be better to take a fresh start instead of trying to 
do this incrementally on top of existing code base.


Well ... that's (almost) what Dogacan did with the HBase port. I agree 
that we should not feel too constrained by the existing code base, but 
it would be silly to throw everything away and start from scratch - we 
need to find a middle ground. The crawler-commons and Tika projects 
should help us to get rid of the ballast and significantly reduce the 
size of our code.


In the history of Nutch this approach is not something new (remember map 
reduce?) and in my opinion it worked nicely then. Perhaps it is 
different this time since the changes we are discussing now have many 
abstract things hanging in the air, even fundamental ones.


Nutch 0.7 to 0.8 reused a lot of the existing code.



Of course the rewrite approach means that it will take some time before 
we actually get into the point where we can start adding real substance 
(meaning new features etc).


So to summarize, I would go ahead and put together a branch nutch N.0 
that would consist of (a.k.a my wish list, hope I am not being too 
aggressive here):


-runs on top of plain hadoop


See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits and 
is easy to use)
-basic http/https crawling functionality (with db abstraction or hbase 
directly and smart data structures that allow flexible and efficient 
usage of the data)

-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the 
hidden gems we might have, or some solutions for the interesting 
challenges.


I believe that's more or less where Dogacan's port is right now, except 
it's not merged with the OSGI port.


ps. many of the interesting challenges in your proposal seem to fall in 
the category of data analysis and manipulation that are mostly, used 
after the data has been crawled or between the fetch cycles so many of 
those could be implemented into current code base also, somehow I just 
feel that things could be made more efficient and understandable if the 
foundation (eg. data structures, extendability for example) was in 
better shape. Also if written nicely other projects could use them too!


Definitely agree with this. Example: the PageRank package - it works 
quite well with the current code, but it's design is obscured by the 
ScoringFilter api and the need to maintain its own extended DB-s.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki

Dennis Kubes wrote:
I would like to get a couple things in this release as well.  Let me 
know if you want help with the upgrade.


You mean you want to do the Hadoop upgrade? I won't stand in your way :)

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki

Dennis Kubes wrote:
I have created NUTCH-768.  I am in the middle of testing a few thousand 
page crawl for the most recent released version of Hadoop 0.20.1. 
Everything passes unit tests fine and there are no interface breaks. 
Looks like it will be an easy upgrade so far :)


Great, thanks!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki

reinhard schwab wrote:

there is some piece of code i dont understand

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
  datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(curTime);
}
if (datum.getFetchTime()  curTime) {
  return false;   // not time yet
}
return true;
  }


First, concerning the segment retention - we want to enforce that pages 
that were not refreshed longer than maxInterval should be retried, no 
matter what is their status - because we want to obtain a copy of the 
page in a newer segment in order to be able to delete the old segment.




why is the fetch time set here to curTime?


Because we want to fetch it now - see the next line where this condition 
is checked.



and why is the fetch interval set to maxInterval * 0.9f whithout
checking the current value of fetchInterval?


Hm, indeed this looks like a bug - we should instead do like this:

if (datum.getFetchInterval()  maxInterval) {
  datum.setFetchInterval(maxInterval * 0.9);
}



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki

BELLINI ADAM wrote:


hi,

dedup doesn't work for me.
I have read that  Duplicates have either the same contents (via MD5 hash) or 
the same URL
in my case i dont have the same URLS but still have the same contents for those 
URLS.
i give you an exemple:  i have three urls that have the same content

1- www.domaine/folder/
2- www.domaine/folder/index.html
3- www.domaine/folder/index.html?lang=fr

but i find all of them in my index :(
i was wondering that dedup will delete 1 and 2 


the dedup wont work correclty !!


Please check the value of the Signature field for all the above urls in 
your crawldb.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki

BELLINI ADAM wrote:
yes i cheked the signatures and it's not the same !! it's realy weird 


the url www.domaine/folder/index.html?lang=fr is just this one 
www.domaine/folder/index.html


Apparently it isn't a bit-exact replica of the page, so its MD5 hash is 
different. You need to use a more relaxed Signature implementation, e.g. 
TextProfileSignature.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,

my two urls points to the same page !


Please, no need to shout ...

If the MD5 signatures are different, then the binary content of these 
pages is different, period.


Use readseg -dump utility to retrieve the page content from the segment, 
extract just the two pages from the dump, and run a unix diff utility.



can you tell m eplz more about TextProfileSignature ? how should i
use it


Configure this type of signature in your nutch-site.xml - please see the 
nutch-default.xml for instructions. Please note that you will have to 
re-parse segments and update the db in order to update the signatures.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch config IOException

2009-11-25 Thread Andrzej Bialecki

Mischa Tuffield wrote:
Hello Again, 

Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. 


!--
2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.init(Configuration.java:176)
at org.apache.hadoop.conf.Configuration.init(Configuration.java:164)
at 
org.apache.hadoop.hdfs.protocol.FSConstants.clinit(FSConstants.java:51)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

--

Any pointers would be great, I wonder is there a way for me to validate my conf 
options before I deploy nutch?


This exception is innocuous - it helps to debug at which points in the 
code the Configuration instances are being created. And you wouldn't 
have seen this if you didn't turn on the DEBUG logging. ;)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-25 Thread Andrzej Bialecki

MilleBii wrote:

I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :

got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad. But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix 


I suspect that you have a bunch of hosts that slowly trickle the 
content, i.e. requests don't time out, crawl-delay is low, but the 
download speed is very very low due to the limits at their end (either 
physical or artificial).


The solution in that case would be to track a minimum avg. speed per 
FetchQueue, and lock-out the queue if this number crosses the threshold 
(similarly to what we do when we discover a crawl-delay that is too high).


In the meantime, you could add the number of FetchQueue-s to that 
diagnostic output, to see how many unique hosts are in the current 
working set.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki

Mischa Tuffield wrote:

Hello All,


http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply 
to an existing message, instead start a fresh email.  Even if you change 
the subject line of your email, other mail headers still track which 
thread you replied to and your question is hidden in that thread and 
gets less attention.   It makes following discussions in the mailing 
list archives particularly difficult.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki

Santiago Pérez wrote:

Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)

property
  nameparser.character.encoding.default/name
  valueWindows-1250/value
  descriptionThe character encoding to fall back to when no other
information
  is available/description
/property

Has anyone had the same problem? (Hungarian o Polish people sure...)


The appearance of characters that you quoted in your other email 
indicates that the problem may be the opposite - your pages seem to use 
UTF-8, and you are trying to convert them using Windows-1250 ... Try 
putting UTF-8 in this property, and see what happens.


Generally speaking, pages should declare their encoding, either in HTTP 
headers or in meta tags, but often this declaration is either missing 
or completely wrong. Nutch uses ICU4J CharsetDetector plus its own 
heuristic (in util.EncodingDetector and in HtmlParser) that tries to 
detect character encoding if it's missing or even if it's wrong - but 
this is a tricky issue and sometimes results are unpredictable.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki

MilleBii wrote:

Interesting updates on the current run of 450K urls :
+ 30minutes @ 3Mbits/s
+ drop to 1Mbit/s (1/X shape)
+ gradual improvement to 1.5 Mbit/s and steady for 7 hours
+ sudden drop to 0.9 Mbits/s and steady for 4 hours
+ up to 1.7 Mbits for 1hour
+ staircasing down to 0.5 Mbit/s by steps of 1 hour

I don't know what to take as a conclusion, but it is quite strange to have
those sudden variation of bandwidth and overall very slow.
I can post the graph if people are interested.


This most likely comes from the allocation of urls to map tasks, and the 
maximum number of map tasks that you can run on your cluster. when tasks 
finish their run, you see a sudden drop in speed, until the next task 
starts running. Initially, I suspect that you have more tasks available 
than the capacity of your cluster, so it's easy to fill the slots and 
max the speed. Later on, slow map tasks tend to hang around, but still 
some of them finish and make space for new tasks. As time goes on, 
majority of your tasks becomes slow tasks, so the overall speed 
continues to drop down.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki

MilleBii wrote:

You mean map/reduce tasks ???


Yes.


Being in pseudo-distributed / single node I only have two maps during the
fetch phase... so it would be back to the URLs distribution.


Well, yes, but my explanation is still valid. Which unfortunately 
doesn't change the situation.


Next week I will be working on integrating the patches from Julien, and 
if time permits I could perhaps start working on a speed monitoring to 
lock out slow servers.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki

Paul Tomblin wrote:

My nutch crawl just stopped.  The process is still there, and doesn't
respond to a kill -TERM or a kill -HUP, but it hasn't written
anything to the log file in the last 40 minutes.  The last thing it
logged was some calls to my custom url filter.  Nothing has been
written in the hadoop directory or the crawldir/crawldb or the
segments dir in that time.

How can I tell what's going on and why it's stopped?


If you run in distributed / pseudo-distributed mode, you can check the 
status in the JobTracker UI. If you are running in local mode, then 
it's likely that the process is in a (single) reduce phase sorting the 
data - with larger jobs in local mode the sorting phase may take very 
long time, due to a heavy disk IO (and in disk-wait state it may be 
uninterruptible).


Try to generate a thread dump to see what code is being executed.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki

Paul Tomblin wrote:

On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote:

Paul Tomblin wrote:



-bash-3.2$ jstack -F 32507
Attaching to process ID 32507, please wait...

Hm, I can't see anything obviously wrong with that thread dump. What's the
CPU and swap usage, and loadavg?


The process is using a lot of CPU.  loadavg is up over 5.

top - 15:12:19 up 22 days,  4:06,  2 users,  load average: 5.01, 5.00, 4.93
Tasks:  48 total,   2 running,  45 sleeping,   0 stopped,   1 zombie
Cpu(s):  1.0% us, 99.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   3170584k total,  2231700k used,   938884k free,0k buffers
Swap:0k total,0k used,0k free,0k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
32507 discover  16   0 1163m 974m 8604 S 394.7 31.5 719:40.71 java

Actually, the memory is a real annoyance - the hosting company doesn't
give me any swap, so when hadoop does a fork/exec just to do a
whoami, I have to leave as much memory free as the crawl reserves
with -Xmx for itself.


Hm, the curious thing here is that the java process is sleeping, and 99% 
of cpu is in system time ... usually this would indicate swapping, but 
since there is no swap in your setup I'm stumped. Still, this may be 
related to the weird memory/swap setup on that machine - try decreasing 
the heap size and see what happens.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: odd warnings

2009-12-01 Thread Andrzej Bialecki

Jesse Hires wrote:

What is segments.gen and segments_2 ?
The warning I am getting happens when I dedup two indexes.

I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2

The warning is happening on both datanodes.

The command I am running is bin/nutch dedup crawl/index1 crawl/index2

If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?

They are created as files from the start
bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY

I don't see any errors or warnings about creating the index.


The command that you quote above produces multiple partial indexes, 
located in crawl/index1/part-N and only in these subdirectories the 
Lucene indexes can be found. However, the deduplication process doesn't 
accept partial indexes, so you need to specify each /part- dir as an 
input to dedup.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio

2009-12-02 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,
i have this error when crawling

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for 
taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out


Most likely you ran out of tmp disk space.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki

MilleBii wrote:

Oops continuing previous mail.

So I wonder if there would be a better  algorithm 'generate' which
would maintain a constant rate of host per 100 url ... Below a certain
threshold it stops or better starts including URLs of lower scores.


That's exactly how the max.urls.per.host limit works.



Using scores is de-optimzing the fetching process... Having said that
I should first read the code and try to understand it.


That wouldn't hurt in any case ;)

There is also a method in ScoringFilter-s (e.g. the default 
scoring-opic), where it determines the priority of URL during 
generation. See ScoringFilter.generatorSortValue(..), you can modify 
this method in scoring-opic (or in your own scoring filter) to 
prioritize certain urls over others.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 1.0 wml plugin

2009-12-07 Thread Andrzej Bialecki

yangfeng wrote:

I have completed the plugin for parsing the wml(wiredless mark language). I
hope to add it to lucene, what i do?



The best long-term option would be to submit this work to the Tika 
project - see http://lucene.apache.org/tika/. If you already implemented 
this as a Nutch plugin, please creata a JIRA issue in Nutch, and attach 
the patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki

Eran Zinman wrote:

Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh


Do you use the scripts in place, i.e. without deploying the nutch*.job 
to a separate Hadoop cluster? Could you please try it with a standalone 
Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 20:33, Kirby Bohling wrote:

On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com  wrote:


hi,

i have a page withmeta name=robots content=noindex,nofollow /, now i know 
that nutch obey to this tag because i dont find the content and the title in my index, but i was 
wondering that this document will not be present in the index. why he keep the document in my index with 
no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why 
nutch still filling the url, date, boostetc since he didnt it for title and 
content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip 
all the document !

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.



My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
  I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index http://example/foo/bar;, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.


Very good explanation, that's exactly the reasons why Nutch never 
discards such pages. If you really want to ignore certain pages, then 
use URLFilters and/or ScoringFilters.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 19:59, Jesse Hires wrote:

I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?


This is a surprisingly difficult problem to solve in general case, 
because it's not always true that 'www.domain' equals 'domain'. If you 
do know this is true in your particular case, you can add a rule to 
regex-urlnormalizer that changes the matching urls to e.g. always lose 
the 'www.' part.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki

On 2009-12-11 22:21, MilleBii wrote:

Guys is there a way you can get Luke to read the index from hdfs:// ???
Or you have to copy it out to the local filesystem?



Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x.
Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1.

Start Luke, dismiss the open dialog, and then go to Plugins / Hadoop, 
and enter the full URL of the index directory (including the hdfs:// 
part). You can also open multiple parts of the index (e.g. if you follow 
the Nutch naming convention, you can directly open the indexes/ 
directory that contains part-N partial indexes).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: OR support

2009-12-14 Thread Andrzej Bialecki

On 2009-12-14 16:05, BrunoWL wrote:


Nobody?
Please, any answer would good.


Please check this issue:

https://issues.apache.org/jira/browse/NUTCH-479

That's the current status, i.e. this functionality is available only as 
a patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException

2009-12-17 Thread Andrzej Bialecki

On 2009-12-17 10:13, Eran Zinman wrote:

Hi,

I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of
Nutch parser reduce tasks.

I know this is a known issue with Nutch (
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058
)

And as far as I can see that patch wasn't committed yet because we wanted to
examine it on the new Hadoop 0.20 version. I am using latest Nutch with
Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it
does) - maybe we should commit the change?


Thanks for reporting this - could you perhaps try to apply that patch 
and see if it helps? I hesitated to commit it because it's really a 
workaround and not a solution ... but if it works for you then it's 
better than nothing.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki

On 2009-12-21 17:15, Sundara Kaku wrote:

Hi,

Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch all the files using Nutch, but nutch is throwing
outofmemory exception for large files ( have set heap size to 2500m), with
heap memory 2500m file size with 250mb are retrieved but larger that that
are failing,
and nutch takes lot of time after printing
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0

if there are three files with size 100mb each then it is failing (at the
same depth, with heap size 2500m) to fetch files.

i have set http.content.limite to -1

is there way to fetch several large files using nutch..

I am using nutch as webcrawler, i am not using Indexing. I want to download
web resources and scan then for virus using ClamA/V.


Probably Nutch is not the right tool for you - you should probably use 
wget. Nutch was designed to fetch many pages of limited size - as a 
temporary step it caches the downloaded content in memory, before 
flushing it out to disk.


(I had to solve this limitation once for a specific case - the solution 
was to implement a variant of the protocol and Content that stored data 
into separate HDFS files without buffering in memory - but it was a 
brittle hack that only worked for that particular scenario).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter - this 
receives a copy of the document with all fields collected just before 
it's sent to the indexing backend - and you can freely modify the 
content of NutchDocument, e.g. do additional analysis, add/remove/modify 
fields, etc.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 16:07, Claudio Martella wrote:

Andrzej Bialecki wrote:

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter -
this receives a copy of the document with all fields collected just
before it's sent to the indexing backend - and you can freely modify
the content of NutchDocument, e.g. do additional analysis,
add/remove/modify fields, etc.


This sounds very interesting. So the idea is to take the NutchDocument
as it comes out of the crawling and modify it (inside of an
IndexingFilter) before it's sent to indexing (inside of nutch),  right?


Correct - IndexingFilter-s work no matter whether you use Nutch or Solr 
indexing.



So how does it relate to nutch schema and solr schema? Can you give me
some pointers?



Please take a look at how e.g. the index-more filter is implemented - 
basically you need to copy this filter and make whatever modifications 
you need ;)


Keep in mind that any fields that you create in NutchDocument need to be 
properly declared in schema.xml when using Solr indexing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dedup remove all duplicates

2010-01-06 Thread Andrzej Bialecki

On 2010-01-06 18:56, Pascal Dimassimo wrote:


Hi,

After I run the index command, my index contains 2 documents with the same
boost, digest, segment and title, but with different tstamp and url. When I
run the dedup command on that index, both documents are removed.

Should the document with the latest tstamp be kept?


It should, out of multiple documents with the same URL (url duplicates) 
only the most recent is retained - unless it was removed because there 
was another document in the index with the same content (a content 
duplicate).


Could you please verify this on a minimal index (2 documents), and if 
the problem persist please report this in JIRA.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Purging from Nutch after indexing with Solr

2010-01-08 Thread Andrzej Bialecki

On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:

I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.

Considering the following sample segment:

2.0Gcontent
196Kcrawl_fetch
152Kcrawl_generate
376Kcrawl_parse
392Kparse_data
441Mparse_text

1. From what I have found through searches content holds the raw fetched
content, is there any problem if I remove it, ie: does nutch needs it to
apply any sort of logic when re-crawling that content/url?


No, they are no longer needed, unless you want to provide a cached 
view of the content.




2. Previous question applies to parse_data and parse_text after i've called
nutch solrindex on that segment.


Depends how you set up your search. If you search using NutchBean (i.e. 
the Nutch web application) then you need them. If you search using Solr, 
then you don't need them.




3. Using samples scritps and tutorials I'm always seeing invertlinks being
called over all segments, but its output mentions merging, when I
fetch/parse new segments can I call invertlinks only over them?


Yes, invertlinks will incrementally merge the existing linkdb with new 
links from a new segment.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread Andrzej Bialecki

On 2010-01-09 10:18, MilleBii wrote:

@Andrzej,

To be more specific if one uses cached content (which I do), what is the
minimal staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text

the rest is not used ... I guess, before I start testing could you confirm ?


crawl_fetch you can ignore - it's just the status of fetching, which 
should be by that time already integrated into crawldb (if you ran 
updatedb).


It's the content/ that you need to display cached view.



@Ulysse,

The other reason to keep all data is if you will need to reindex all
segments, which does happen in development  test phases, less in
production  though.


Right. Also, a common practice is to keep the raw data for a while just 
to make sure that the parsing and indexing went smoothly (in case you 
need to re-parse the raw content).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adding additional metadata

2010-01-11 Thread Andrzej Bialecki

On 2010-01-11 13:18, Erlend Garåsen wrote:


First of all: I didn't know about the list archive, so sorry for not
searching that resource before I sent a new post.

MilleBii wrote:

For lastModified just enable the index|query-more plugins it will do
the job for you.


Unfortunately not. Our pages include Dublin core metadata which has a
Norwegian name.


For other meta searc the mailing list its explained many times how to
do it


I found several posts concerning metadata, but for me, one question is
still unanswered: Do I really have to create a lot of new classes/xml
files in order to store the content of just two metadata? I have not
managed to parse the content of the lastModified metadata after I tried
to rewrite the HtmlParser class. So I tried to add hard coded metadata
values in HtmlParser like this instead:
entry.getValue().getData().getParseMeta().set(dato.endret, 01.01.2008);

My modified MoreIndexingFilter managed to pick up the hard coded values,
and the dates were successfully stored into my Solr Index after running
the solrindex option.

This means that it is not necessary to write a new MoreIndexingFilter
class, but I'm still unsure about the HtmlParser class since I haven't
managed to parse the content of the metadata.


You can of course hack your way through HtmlParser and add/remove/modify 
as you see fit - it's straightforward and likely you will get the result 
that you want.


However, as MilleBii suggests, the preferred way to do this would be to 
write a plugin. The reason is the cost of a long-term maintenance - if 
you ever want to sync up your local modified version of Nutch with the 
newer public release, your hacked copy of HtmlParser won't merge nicely, 
whereas if you put your code in a separate plugin then it might. Another 
reason is configurability - if you put this code in a separate plugin, 
you can easily turn it on/off, but if it sits in HtmlParser this would 
be more difficult to do.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Andrzej Bialecki

On 2010-01-11 18:40, Godmar Back wrote:

On Mon, Jan 11, 2010 at 12:30 PM, Fuad Efendif...@efendi.ca  wrote:

Googling reveals
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you
could try increasing the Java stack size in bin/nutch (-Xss), or use
an alternate regexp if you can.

Just out of curiosity, why does a performance critical program such as
Nutch use Sun's backtracking-based regexp implementation rather than
an efficient Thompson-based one?  Do you need the additional
expressiveness provided by PCRE?



Very interesting point... we should use it for BIXO too.


BTW, SUN has memory leaks with LinkedBlockingQueue,
http://bugs.sun.com/view_bug.do?bug_id=6806875
http://tech.groups.yahoo.com/group/bixo-dev/message/329



I don't think we use this class in Nutch.



And, of course, URL is synchronized; Apache Tomcat uses simplified version
of URL class.
And, RegexUrlNormalizer is synchronized in Nutch...
And, in order to retrieve plain text from HTML we are creating fat DOM
object (instead of using, for instance, filters in NekoHtml)


We are creating a DOM tree because it's much easier to write filtering 
plugins that work with DOM tree than implement Neko filters. Besides, we 
provide an option to use TagSoup for HTML parsing, which is not only 
more resilient to HTML errors but also more efficient.


Besides, Nutch is built around plugins. Deactivate parse-html and write 
your own HTML plugin that avoids these inefficiencies, and we'll be 
happy to include it in the distribution.



And more...



I'm no expert, but the reason I brought this up for discussion was
that I recently encountered a paper that pointed out that regular
expression matching accounts for a significant fraction of total
runtime in search engine indexers [1] and thus it's something that's
usually optimized.

  - Godmar

[1] http://portal.acm.org/citation.cfm?id=1542275.1542284


This StackOverflow came probably from the urlfilter-regex, which indeed 
uses Java regex, definitely one of the worst implementations. The reason 
it's used by default in Nutch is that it's standard in JDK, FWIW.


For high-performance crawlers I usually do the following:

* avoid regex filtering completely, if possible, instead using a 
combination of prefix/suffix/domain/custom filters


* use urlfilter-automaton, which is slightly less expressive but much 
much faster.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Post Injecting ?

2010-01-15 Thread Andrzej Bialecki

On 2010-01-15 20:09, MilleBii wrote:

Inject is meant to seed the database at the start.

But I would like to inject new urls on a production crawldb, I think it
works but I was wondering if somebody could confirm that.



Yes. New urls are merged with the old ones.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: merge not working anymore

2010-01-18 Thread Andrzej Bialecki

On 2010-01-18 21:56, MilleBii wrote:

Help !!!

My production environment is blocked by this error.
I deleted the segment altogether and restarted crawl/fetch/parse... and I'm
still stuck, so I can not add segments anymore.
Looking like a hdfs problem ???

2010-01-18 19:53:00,785 WARN  hdfs.DFSClient - DFS Read:
java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735
file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx


This error is commonly caused by running out of disk space on a datanode.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: About HBase Integration

2010-02-09 Thread Andrzej Bialecki

On 2010-02-09 03:08, Hua Su wrote:

Thanks. But heritrix is another project, right?



Please see this Git repository, it contains the latest work in progress 
on Nutch+HBase:


git://github.com/dogacan/nutchbase.git

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SegmentFilter

2010-02-20 Thread Andrzej Bialecki

On 2010-02-20 22:45, reinhard schwab wrote:

the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from

Recno:: 383
URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519


Duplicate content is usually related to the fact that indeed the same 
content appears under different urls. This is common enough, so I don't 
see this necessarily as a bug in Nutch - we won't know that the content 
is identical until we actually fetch it...


Urls may differ in certain systematic ways (e.g. by a set of URL params, 
such as sessionId, print=yes, etc) or completely unrelated (human 
errors, peculiarities of the content management system, or mirrors). In 
your case it seems that the same page is available under different 
values of g2_highlightId.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki

On 2010-02-20 23:32, reinhard schwab wrote:

Andrzej Bialecki schrieb:

On 2010-02-20 22:45, reinhard schwab wrote:

the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from

Recno:: 383
URL::
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519


Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I
don't see this necessarily as a bug in Nutch - we won't know that the
content is identical until we actually fetch it...

Urls may differ in certain systematic ways (e.g. by a set of URL
params, such as sessionId, print=yes, etc) or completely unrelated
(human errors, peculiarities of the content management system, or
mirrors). In your case it seems that the same page is available under
different values of g2_highlightId.



i know. i have implemented several url filters to filter duplicate content.
there is a difference here.
the difference here is that in this case the same content is stored
under the same url several times.
it is stored under
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
and not under
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

the content for the latter url is empty.
Content:


Ok, then the answer can be found in the protocol status or parse status. 
You can get protocol status by doing a segment dump of only the 
crawl_fetch part (disable all other parts, then the output is less 
confusing). Similarly, parse status can be found in crawl_parse.





--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch v0.4

2010-02-25 Thread Andrzej Bialecki

On 2010-02-24 17:34, Pedro Bezunartea López wrote:

Hi Ashley,

Hi,

I'm looking to reproduce program analysis results based on Nutch v0.4. I
realize this is a very old release, but is it possible to obtain the source
from somewhere? I see some of the classes I'm looking for in v0.7, but I
need the older version to confirm it.
Thanks,
Ashley



You can get version 0.6 and higher from apache's archive:

http://archive.apache.org/dist/lucene/nutch/

... but I haven't found anything older,


AFAIK older releases of Nutch were archived only on that old SF site, 
and apparently that site no longer exists. Sorry :( However, you can 
still check out that code from CVS repository at nutch.sf.net .



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Update on ignoring menu divs

2010-02-28 Thread Andrzej Bialecki

On 2010-02-28 18:42, Ian M. Evans wrote:

Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

Is there a to define a div by id that nutch will strip out before
tossing the content into solr?


There is no such functionality out of the box. One direction that is 
worth pursuing would be to create an HtmlParseFilter plugin that wraps 
the Boilerpipe library http://code.google.com/p/boilerpipe/ .


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: New version of nutch?

2010-03-03 Thread Andrzej Bialecki

On 2010-03-03 20:12, John Martyniak wrote:

Does anybody have an idea of when a new version of nutch will be
availale,  specifically supporting a latest version of hadoop. And
possibly hbase?

Thank you for any information.


We should roll out a 1.1 soon (a few weeks), the nutch+hbase is imho 
still a few months away.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki

On 2010-03-08 14:55, BELLINI ADAM wrote:



is there any idea guys ??



From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +



hi,
the content of my redirected urls is empty...but still have the other 
metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?


There are two ways to redirect - one is with protocol, and the other is 
with content (either meta refresh, or javascript).


When you dump the segment, is there really no content for the redirected 
url?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: form-based authentication? Any progress

2010-03-10 Thread Andrzej Bialecki

On 2010-03-10 19:26, conficio wrote:



Susam Pal wrote:


Hi,

Indeed the answer is negative and also, many people have often asked
this in this list. Martin has very nicely explained the problems and
possible solution. I'll just add what I have thought of. I have often
wondered what it would take to create a nice configurable cookie based
authentication feature. The following file would be needed:-
...
http://wiki.apache.org/nutch/HttpPostAuthentication


I was wondering if there has been any work done into this direction?

I guess the answer is still no.

Would the problem become easier, if one targets particular types of sites,
such as popular Wiki, Bug Trackers, Blogs, CMS, Forum, document management
systems (first)?



I was involved in a project to implement this (as a proprietary plugin). 
In short, it requires a lot of effort, and there are no generic 
solutions. If it works with one site, it breaks with another, and 
eventually you end up with a nasty heap of hacks upon hacks. In that 
project we gave up after discovering that a large number of sites use 
Javascript to create and name the input controls, and they used a 
challenge-response with client-side scripts generating the response ... 
it was a total mess.


So, if you target 10 sites, you can make it work. If you target 10,000 
sites all using slightly different methods, then forget it.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Where are new linked entries added

2010-03-11 Thread Andrzej Bialecki

On 2010-03-11 15:53, nikinch wrote:


Hi everyone

I've been using nutch for a while now and i've come up on a snag.

I'm trying to find where new  linked pages are added to the segment as a
specific entry.
To make myself clear i've been through the fetch class and the crawlDBFilter
and reducer.
But i'm looking for the initial entry where, for a given page, the links are
transformed into segment entries, my objective here is to pass down te
initial inject url  to all it's liked pages. So when i create an entry for
the linked urls of a wegbpage i'll add metadata to their definition giving
them this originating url.
By the time i get to CrawlDBFilter i already have entries for linked pages
and lost the notion of which seed url brought us here.
I thought the job would be done in the Fetcher maybe in the output function
but i'm not finding where it happens. So if anyone knows and could point me
in the right direction i'd appreciate it.


Currently the best place to do this is in your implementation of a 
ScoringFilter, in distributeScoreToOutlinks(). You can also modify one 
of the existing scoring plugins. I would advise against modifying the 
code directly in ParseOutputFormat, it's complex and fragile.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Andrzej Bialecki

On 2010-03-12 12:52, Pedro Bezunartea López wrote:

Hi,

I'm developing a site that has shows the dynamic content in adiv
id=content, the rest of the page doesn't really change. I'd like to store
and index only the contents of thisdiv, to basically avoid re-indexing
over and over the same content (header, footer, menu).

I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a
couple of things:

1.- Should I extend the parse-html plugin, or should I just replace it?


You should write an HtmlParseFilter and extract only the portions that 
you care about, and then replace the output parseText with your 
extracted text.



2.- The example talks about finding a meta tag, extracting some information
from it, and adding a field in the index. I think I just need to get rid of
all html except the div id=content tag, and index its content. Can someone
point me in the right direction?


See above.


And just one more thing, I'd like to give a higher score to pages which the
search terms appear in the title. Right now pages that contain the terms in
the body rank higher than those that contain the search terms in the title,
how could I modify this behaviour?


You can define these weights in the configuration, look for query boost 
properties.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



<    1   2   3   4   5   6   7   >