Re: About link analysis and filter usage, and Recrawling

2008-03-12 Thread Enis Soztutar

Hi,

Vinci wrote:

Hi,

please see below for the follow up question

Enis Soztutar wrote:
  

3. If I need to processing the crawled page in more flexible way, Is it
better I dump the document to process but not write back, or I write my
plugin on the some phase? If I need to write plugin, which pharse is the
best point for me to implement my own extension?
  
  

This depends on what you want to do with want kind of data. You should
be more specific.



I need to remove the unnecessary html, xslt transformation(which will deal
with the encoding issue for me), as well as file generation. 
For the program I have, dump out everything and not write back is much

preferred, but look like If I do so I will lose some information of the page
crawled?

  
You can write a parse-html plugin for this, or you can manually 
manipulate the parse data by writing a

mapreduce program.

Enis Soztutar wrote:
  

4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of
the
crawling?
  
  

no, linkdb is used in the indexing phase.



So If I use other Indexer like Solr, I need to do additional processing on
the page in order to keep the source link information? (like add the source
link infomation)
  
Using another indexing backend has not yet been committed(see 
NUTCH-442), even if it was

you do not need to do extra processing. Just the usual steps like :

inject->generate->fetch->updatedb->invertlinks->index |
   
^------|

Enis Soztutar wrote:
  

5. Is there any method to avoid nutch recrawl a page in recrawling
script?
(e.g. not to crawl a page since no update from last time) Any information
can provided me to implement this?
  
  
With the adaptive crawl, after several cycles the fetch frequency of a 
page will be

automatically adjusted.



so If I keep on crawling based on same crawldb, I will get this effect?
  

yes, exactly.

Enis Soztutar wrote:
  

Thank you for reading this long post, and any answer or suggestion
  
  

You're welcome.

Enis



Thank you for your kindly help, it really help a lots :)

  


Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Enis Soztutar

Hi,

please see below

Vinci wrote:

Hi everybody,

I am trying to use nutch to implement my spider algorithm...I need to get
information from specific resources, then schedule the crawling based on the
link it found (i.e. nutch will be an link analyzer as well as crawler)

Question here:
1. How can I get the links in linkdb? Is there any other method other than
bin/nutch readlinkdb -dump?
  
linkdb is composed of several MapFiles. You can use MapFile.Reader to 
read it.

But you can also use Mapreduce to do the processing.

2. I want all of my page crawled not begin updated, but I know I will do the
recrawling based on the those crawled page, Is there any other method other
than I dump the crawldb every time?
  

yes, you can process the crawldb as above, or use updatedb etc.

3. If I need to processing the crawled page in more flexible way, Is it
better I dump the document to process but not write back, or I write my
plugin on the some phase? If I need to write plugin, which pharse is the
best point for me to implement my own extension?
  

This depends on what you want to do with want kind of data. You should
be more specific.

4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of the
crawling?
  

no, linkdb is used in the indexing phase.

5. Is there any method to avoid nutch recrawl a page in recrawling script?
(e.g. not to crawl a page since no update from last time) Any information
can provided me to implement this?
  
With the adaptive crawl, after several cycles the fetch frequency of a 
page will be

automatically adjusted.


Thank you for reading this long post, and any answer or suggestion
  

You're welcome.

Enis


Re: Hadoop distributed search.

2007-12-07 Thread Enis Soztutar

Dennis,

Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs. 
Intuitively I believe RAMDirectory should be faster, isn't it ? Do you 
have any benchmark for the two?


Dennis Kubes wrote:



Trey Spiva wrote:
According to a hadoop tutorial  
(http://wiki.apache.org/nutch/NutchHadoopTutorial) on wiki,


"you don't want to search using DFS, you want to search using local 
filesystems.Once the index has been created on the DFS you can 
use the hadoop copyToLocal command to move it to the local file 
system as such" ... "Understand that at this point we are not using 
the DFS or MapReduce to do the searching, all of it is on a local 
machine".


So my understanding is that hadoop is only good for batch index 
building, and is not proper for incremental index building and 
search. Is this true?


That is correct.  DFS for batch processing and MapReduce jobs.  Local 
servers (disks) for serving indexes.  Even better put local indexes 
(not segments, just indexes) in RAM.




The reason I am asking is that when I read the article ACM article by 
Mike Cafarella and Doug Cutting, to me it  sounded like the concern 
was to make the index structures fit in the primary memory, not the 
entire crawled database.  Did I miss understand the ACM article?


No, what they are saying is the more pages per index per machine on 
hard disk the slower the search.  Keeping the main indexes, but not 
the segments which hold raw page content, in RAM can speed up search 
significantly.


One way to do this if you are running on linux is to create a tempfs 
(which is ram) and then mount the filesystem in the ram.  Then your 
index acts normally to the application but is essentially served from 
Ram.  This is how we server the Nutch lucene indexes on our web search 
engine (www.visvo.com) which is ~100M pages.  Below is how you can 
achieve this, assuming your indexes are in /path/to/indexes:



mv /path/to/indexes /path/to/indexes.dist
mkdir /path/to/indexes
cd /path/to
mount -t tmpfs -o size=2684354560 none /path/to/indexes
rsync --progress -aptv indexes.dist/* indexes/
chown -R user:group indexes

This would of course be limited by the amount of RAM you have on the 
machine.  But with this approach most searches are sub-second.


Dennis Kubes



Re: Question about nutch and solr

2007-12-07 Thread Enis Soztutar

Hi,

To clarify things a bit, let me explain lucene and her children a bit.

Lucene : an inverted indexing library,
Solr  : a kind of index server application, that wraps and extends 
the capabilities of lucene.

Hadooop : an implementation of mapreduce and DFS
Nutch  :  a search engine build on top of hadoop , lucene, and solr(very 
soon).


Your architecture will very much depend on your schema of your data and 
the way you will use it. You can store your data in DFS(hadoop), use 
nutch to build the index(either by lucene or solr). Then serve the index 
using either solr or nutch. But you have to copy the indexes to 
local(serving indexes from dfs is very slow). I think it is better you 
give some more insight about your problem and your data.



zhang gaozhi wrote:

 Dear,

   Now, i am looking into the nutch, solr, lucence and hadoop.
   Because the nutch can work with hadoop and our application is
special, we want to use the nucth to do the index about the xml format file,
and store the index files into hadoop. Thenly, we use the solr to do
seaching from index files in hadoop.
  My question is whether the nutch can work well for that or not. If
not, what do we need to do ?
  Thanks for your reply.

 thanks

 gaozhi

  


Re: Hadoop .15 and eclipse on windows

2007-11-09 Thread Enis Soztutar
I checked for the usages of df in Hadoop for you. DF is used by many 
parts of  the system including dfs and mapred, but it will not run on a 
node that does not run TaskTracker of Datanode. Could you please check 
your configuration if you're missing smt. Can you confirm that you just 
only submit the job from the windows machine?


Tim Gautier wrote:

Thanks for the reply, but I think you missed my point.  I've been
running nutch and Hadoop through eclipse under Windows for several
months now.  It never called Linux shell commands before, now it is.
Maybe it called df from some path I never hit, I don't know.  What I
do know is that I could do an entire Nutch crawl in eclipse on Windows
with Hadoop 0.13 and I can't even inject with the latest version in
trunk which is using Hadoop 0.15 because it calls df.  I realize I can
run it from the command line under cygwin, I just don't want to, I
want to use eclipse like I have been for months.

I feel like I'm missing something simple, but I can't figure out what
it is.  Anyone else have any ideas?

On Nov 9, 2007 7:00 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote:
  

Hadoop has been running df for a long time way before 0.13. You can run
hadoop under cygwin ın windows. Please refer to Hadoop's documentation.


Tim Gautier wrote:


I do my nutch development and debugging on a Windows XP machine before
transferring my jar files to a Linux cluster for actual production
runs.  This has worked fine in the past, but there seems to be a
problem now that we're using Hadoop .15.  Now when I run the injector
(presumably other classes as well) from Eclipse, Hadoop makes a shell
command call to "df" and of course doesn't find it on a Windows
machine so the job fails.  There has to be a way around this so I can
debug from Eclipse, anybody know what that might be?  Configuration
setting or something to tell it that I'm running on Windows?


  


Re: Hadoop .15 and eclipse on windows

2007-11-09 Thread Enis Soztutar
Hadoop has been running df for a long time way before 0.13. You can run 
hadoop under cygwin ın windows. Please refer to Hadoop's documentation.


Tim Gautier wrote:

I do my nutch development and debugging on a Windows XP machine before
transferring my jar files to a Linux cluster for actual production
runs.  This has worked fine in the past, but there seems to be a
problem now that we're using Hadoop .15.  Now when I run the injector
(presumably other classes as well) from Eclipse, Hadoop makes a shell
command call to "df" and of course doesn't find it on a Windows
machine so the job fails.  There has to be a way around this so I can
debug from Eclipse, anybody know what that might be?  Configuration
setting or something to tell it that I'm running on Windows?

  


Re: Multiple Domains Search

2007-11-04 Thread Enis Soztutar

Hi,

Unfortunately QueryParser used in nutch will not parse queries of the 
form site:, , but i've hit the same problem and started 
working on it. I will create a jira issue for this. You can refer there.


karthik085 wrote:

Hi,

To search a query from a particular domain from the index - I do:
 site:

How do I search from multiple domains? Any other nifty options or plugins
that can help to achieve this?

Another way I can think is (not efficient I believe) - create different
combinations of site indexes. Change the location for searching on the fly.

Any help is appreciated. Thanks.

Thanks,
Kartrthik
  


Re: distributed search server

2007-09-27 Thread Enis Soztutar
Yes, you have to do it manually for now, but it is not so complicated to 
reopen the index if it is changed, using IndexReader's methods.


We are using start-stop daemon to start/stop the index servers. Daemon 
can save the pid in a file and then you can kill the process with the 
given pid.


Milan Krendzelak wrote:

I think you have to do it manually.
I have bash script under linux that take care of the start up the nutch servers. 
In order to reopen index, I do kill all java process manual ( for now ), but will be good to have some king of the same script but for shut it down. 
Actually, I am having the same problem and would be interested in any information about it.
 
Milan Krendzelak

Senior Software Developer
dotMobi Certified: http://dev.mobi/node/276  
 
mTLD Top Level Domain Limited is a private limited company incorporated and registered in the Republic of Ireland with registered number 398040 and registered office at Arthur Cox Building, Earlsfort Terrace, Dublin 2




From: charlie w [mailto:[EMAIL PROTECTED]
Sent: Wed 26/09/2007 13:28
To: nutch-user@lucene.apache.org
Subject: distributed search server



Is there a way to get a nutch search server to reopen the index in which it
is searching?  Failing that, is there a graceful way to restart the
individual search servers?

Thanks
Charlie



  


Re: How to treat # in URLs?

2007-08-13 Thread Enis Soztutar
Technically, the fragment is a part of the url, but foo and foo#bar 
points to the same location, so it should be stripped out. Are you using 
url-normalizers. If not could you please try them.


Carl Cerecke wrote:

Hi,

I noticed that urls with a # in them are not handled any differently 
to normal urls. See output of readdb:


http://127.0.0.1:8000/about.htmlVersion: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:41:55 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0

http://127.0.0.1:8000/about.html#topVersion: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:42:03 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0

I would have expected that, when doing an updatedb, the #foobar part 
of the URL would be stripped.


Is there a sensible reason for the current behaviour? Or have I found 
a bug?


Cheers,
Carl.



Re: getting document link graph

2007-07-24 Thread Enis Soztutar
Linkdb contains all the information about the web graph. After fetching 
the segments, you should run bin/nutch invertlinks to build the linkdb, 
which is a MapFile. The entries in the MapFile are  pairs, 
where keys are Text objects(containing urls) and values are Inlinks 
objects. In fact FYI, linkdb can easily be "processed" by map-reduce jobs.


DS jha wrote:

Hi -

I want to read the map of incoming and outgoing links of a document
and use that for some analysis purpose.  Does nutch store link graph
once fetch/parse/index is complete?

After browsing thru the code, it does seem that during document
parsing and storing, incoming and outgoing links are getting passed
around between objects but is that information available once the
process is complete - by reading segment or index information?

Thanks,
Jha



Re: IndexFilter

2007-07-18 Thread Enis Soztutar
enabled plugins that implement IndexingFilter are run for each file to 
generate the fields to index. enabled plugins can be found in 
conf/nutch-default.xml or conf/nutch-site.xml.


You can look at http://wiki.apache.org/nutch/IndexStructure.


Kai_testing Middleton wrote:

Not sure ... this is kind of an off-the-cuff reply, but Luke might give you 
that information (google for apache luke).

- Original Message 
From: Daniel Clark <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Tuesday, July 17, 2007 3:22:26 PM
Subject: IndexFilter

Which indexFilter plugin does Nutch use out-of-the-box?  Or how do I find
out?  I'm trying to figure out how the following fields are being indexed.

 


anchor

boost

content

digest

host

segment

site

title

tstamp

url

 

 









   


Moody friends. Drama queens. Your life? Nope! - their life, your story. Play 
Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  
  


Re: Adding meta data to searched documents

2007-07-02 Thread Enis Soztutar
You can write index plugins. Please first read the (slighlty outdated) 
tutorial and then checkhttp://wiki.apache.org/nutch/PluginCentral. 
Optionally you may want to write html parse plugins depending on the 
source of the data.


Chris Hane wrote:
I am looking to use nutch to crawl/index a website.  A lot of the 
pages have videos on them.  We have transcripts for the videos that we 
would like to be included for indexing; but we do not want to put the 
transcripts on the web pages.


Is there a way to "add" this information to a given web page for 
purposes of indexing as part of the crawl process?  Maybe another 
point in the process before the index is generated?  I am hoping there 
is a point in the crawl process where I can add augmented content to a 
page in the nutch segment (rough thought based on very limited time 
spent looking at nutch).


We are comfortable using java and can write custom code as needed.  I 
would appreciate any pointers on where to look in the nutch code.


Thanks in advance,
Chris.



Re: Stemming with Nutch

2007-06-28 Thread Enis Soztutar



Doğacan Güney wrote:

On 6/28/07, Robert Young <[EMAIL PROTECTED]> wrote:

Hi,

Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org


There is some sort of stemming for German and French languages
(available as plugin analysis-de and analysis-fr). I don't know how
well they work (or if they work). AFAIK, there is no support for
stemming English.
There is PorterStemmer in lucene, but is not used in nutch. You can 
easily add this by overriding NutchDocumentAnalyzer.




Btw, I think we should revise nutch's document analysis system. For
example, analyzers for index-basic's fields are hard-coded in analysis
package (what happens if I don't use index-basic and use my own
index-mind-blowingly-awesome plugin?) . You either have to use all of
it or completely override it and use none of it. We should allow index
plugins to specify their analyzers per field. There are analysis-*
plugins but they work for documents in specific languages (what if I
don't want to use language identification? what if nutch can't figure
out what the language is?)


I strongly agree. Index-* plugins and analysis-* plugins are cross 
dependent. For every new field added by the indexing plugins, ALL the 
analysis plugins should be changed to analyze this new field, which 
brakes the golden rule. I agree with the idea that index plugins should 
specify their analyzers.


Index plugins should also be able control how stuff like their field's
length norm is calculated (which currently is hard coded too and can't
be changed).

Oh and, if you are feeling up to it, any help in this area would be
much appreciated :).



Thanks
Rob






Re: Weird encoding problem

2007-06-26 Thread Enis Soztutar
i suggest you first open the index with luke and check that the encoding 
is detected correct, and make a search from luke to see if you get any 
answers. Then you may invoke org.apache.nutch.searcher.Query to see if 
you query is parsed and translated correctly. Finally, you may check 
tomcat whether it uses utf-8 encoding.


Karol Rybak wrote:

Hello, i have set up nutch to do some more testing, after indexing couple
thousand pages i tried to do some searching. Everything works fine, 
however

there's one problem, i cannot search using polish characters. I tried
searching for a query like "Materiały dydaktyczne" i got no results, 
and the
text in the search field changed into "Materiały dydaktyczne". 
Everything

else is fine when i search for "dydaktyczne" results show up and the
encoding is ok. Do you have any idea what could be wrong ?


Re: AW: AW: Combining standard Lucene and Nutch

2007-04-11 Thread Enis Soztutar

Michael Böckling wrote:
Yes Nutch uses a Query class different then lucene. The query is also 
parsed differently,
What nutch does basically is that, nutch parses the query with 
Query.parse, then it runs
all the query plugins, which convert the nutch query to 
lucene boolean 
query. Then this lucene

query is sent to index servers, which uses lucene's searchers.



Does that mean that Nutch will choke on "original Lucene" search queries?
Since the same query is (in my case) fed to both Nutch for the static
content and  Lucene for the dynamic stuff, that wouldn't work well, i'm
afraid.


  
No, you do not actually need to use nutch's Query classes. But you need 
to do some extra work in the index servers.
I do not know your architecture, but i assume that you need to modify 
IndexSearcher to take Boolean query instead of

org.apache.nutch.searcher.Query
- My working environment for the current search is Java 
  

1.4.2 and Lucene

2.1. I guess I have to use Nutch 0.8 (since 0.9 switched to 
  

Java 1.5) and


hope it can cope with the newer Lucene version?

  
  

Nutch 0.9 uses lucene 2.1.



Yes, but Nutch 0.9 requires Java 5, and unfortunately I have to stick with
1.4.2.

Greetings and thanks a lot,

Michael

  




Re: AW: Combining standard Lucene and Nutch

2007-04-11 Thread Enis Soztutar

Michael Böckling wrote:
What you should do is to compare the structure nutch uses with the 
structure you use, and somehow combine the two. In most of 
the fields, 
you sould converge to the nutch version. Other than that, 
once index the 
index is created from nutch, it is lucene stuff. You can merge the 
indexes or run a MultiSearcher, or open seperate 
DistributedSearch$Clients and combine the results from 
seperate indexes 
on the fly. However there is an issue about summaries. Do you 
intend to use them?



I see. I don't think I can unify the index fields, since we use a very
granular field structure for our DB content. It would be ok to have the
results displayed on the web page separated, with the first paragraph
showing the DB search results and the second one for the Nutch results,
effectively running and querying the two indexes separately.
  

Then it is more simple to use lucene and nutch together

Further issues:
- Are Lucene and Nutch Queries compatible? I've heard the "Query" class
hierarchy is different for Nutch. Basically, a query that works for Lucene
(maybe containing boolean operators, phrases etc.) should not throw an
exception or so in Nutch and return sensible results.
  
Yes Nutch uses a Query class different then lucene. The query is also 
parsed differently,
What nutch does basically is that, nutch parses the query with 
Query.parse, then it runs
all the query plugins, which convert the nutch query to lucene boolean 
query. Then this lucene

query is sent to index servers, which uses lucene's searchers.


- I need to exclude things like header, footer and navigation from the
crawled pages and only index the content of a certain area. Can this be done
in Nutch? I found some vague hints pointing to HtmlParser and Plugins...
  

Yes you can write a html plugin to only parse desired content.

- My working environment for the current search is Java 1.4.2 and Lucene
2.1. I guess I have to use Nutch 0.8 (since 0.9 switched to Java 1.5) and
hope it can cope with the newer Lucene version?

  

Nutch 0.9 uses lucene 2.1.

Thanks a lot for your help so far!


  

Welcome :)

Regards,

Michael Böckling

  




Re: Combining standard Lucene and Nutch

2007-04-11 Thread Enis Soztutar

Michael Böckling wrote:

Hi!

  

Hi,

I know there is a MultiSearcher class, but it seems that Nutch is using a
very different index layout than Lucene, or am I wrong here? 
Nutch uses lucene as an inverted index. Lucene does not have an index 
structure. You create the structure
(I mean the fields) using lucene. Nutch stores some default fields in 
the index as well as extra fields from index
plugins. You can check out the structure of the index from the wiki : 
http://wiki.apache.org/nutch/IndexStructure


What you should do is to compare the structure nutch uses with the 
structure you use, and somehow combine the two. In most of the fields, 
you sould converge to the nutch version. Other than that, once index the 
index is created from nutch, it is lucene stuff. You can merge the 
indexes or run a MultiSearcher, or open seperate 
DistributedSearch$Clients and combine the results from seperate indexes 
on the fly. However there is an issue about summaries. Do you intend to 
use them?






My end goal is
a list of results with the most relevant hits from both indexes at the top
positions.

How would you go about this?
Thanks a lot for your input!

Regards,

Michael

  




Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread Enis Soztutar

Andrzej Bialecki wrote:

[EMAIL PROTECTED] wrote:

Hi Enis,




Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to
re-appear in one of the subsequent fetches.  Well, it won't
re-appear, because it will remain missing, but it would be great to
be able to tell Nutch to "forget it" "from everywhere".  Is that
doable? I could read and re-write the *Db Maps, but that's a lot of
IO... just to get a couple of URLs erased.  I'd prefer a friendly
persuasion where Nutch flags a given page as "forget this page as
soon as possible" and it just happens later on.


Somehow you need to flag those pages, and keep track of them, so they 
have to remain CrawlDb.


The simplest way to do this is, I think, through a scoring filter API 
- you can add your own filter, which during updatedb operation flags 
unwanted urls (by means of putting a piece of metadata in CrawlDatum), 
and then during the generate step it checks this metadata and returns 
the generateScore = Float.MIN_VALUE - which means this page will never 
be selected for fetching as long as there are other unfetched pages.


You can also modify the Generator to completely skip such flagged pages.

Maybe we should permanently remove the urls that failed fetching k times 
from the crawldb, during updatedb operation. Since the web is highly 
dynamic there can be as many gone sites as new sites(or slightly less).  
As far as i know  once a url is entered to the crawldb it will stay 
there with one of the possible states : STATUS_DB_UNFETCHED,  
STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?


This way Otis's case will also be resolved.



Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Enis Soztutar
Great work, could you just post these into the nutch wiki as a step by 
step tutorial to new comers.


zzcgiacomini wrote:
I have spent sometime playing with nutch-0 and collecting notes from 
the mailing lists ...
may be someone will find these notes useful end could point me out  
mistakes

I am not at all a nutch expert...
-Corrado






 0) CREATE NUTCH USER AND GROUP

Create a nutch user and group and perform all the following logged in as 
nutch user.
put this line in your .bash_profile

export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

downloaded the nutch and hadoop trunks as well explained on 
http://lucene.apache.org/hadoop/version_control.html

(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

Ex: 


Build and produce the tar file
cd hadoop/trunk
ant tar

To build hadoop with native libraries 64bits proceed as follow :

A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old 


tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install

B) compile native 64bit libs for hadoop  if needed

cd hadoop/trunk/src/native

export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64

CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" 
./configure

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

in config.h replace the line

#define HADOOP_LZO_LIBRARY libnotfound.so 


with this one

#define HADOOP_LZO_LIBRARY "liblzo2.so"
make 


 3) BUILD NUTCH

nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar 


mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar

 4) INSTALL

copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
In my case I only have two identical machines available called myhost2 and myhost1. 

On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop 
distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10



on both machines create the directory:
mkdir /disk10/hadoopFs/ 


copy hadoop 64bit native libraries  if needed

mkdir /opt/nutch/lib/native/Linux-x86_64

cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.

A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used 


   fs.default.name : myhost1.mydomain.org:9010
   mapred.job.tracker  : myhost1.mydomain.org:9011
   mapred.map.tasks: 40
   mapred.reduce.tasks : 3
   dfs.name.dir: /opt/hadoopFs/name
   dfs.data.dir: /opt/hadoopFs/data
   mapred.system.dir   : /opt/hadoopFs/mapreduce/system
   mapred.local.dir: /opt/hadoopFs/mapreduce/local
   dfs.replication : 2

   "The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
This should be a multiple of the number of computers that you have. 
In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks.


   "The dfs.replication property states how many servers a single file 
should be
   replicated to before it becomes available.  Because we are using 2 servers I have set 
   this at 2. 


   may be you want also change nutch-site by adding  with a different value 
then the default of 3

   http.redirect.max   : 10

 
B) be sure that your  conf/slaves file contains the name of the slaves machines. In my cases:


   myhost1.mydomain.org
   myhost2.mydomain.org

C) create directories for pids and log files on both machines

   mkdir /opt/nutch/pids
   mkdir /opt/

Re: Removing pages from index immediately

2007-04-05 Thread Enis Soztutar
Since hadoop's map files are write once, it is not possible to delete 
some urls from the crawldb and linkdb. The only thing you can do is to 
create the map files once again without the deleted urls. But running 
the crawl once more as you suggested seems more appropriate. Deleting 
documents from the index is just lucene stuff.


In your case it seems that every once in a while, you crawl the whole 
site, and create the indexes and db's and then just throw the old one 
out. And between two crawls you can delete the urls from the index.


[EMAIL PROTECTED] wrote:

Hi,

I'd like to be able to immediately remove certain pages from Nutch (index, 
crawldb, linkdb...).
The scenario is that I'm using Nutch to index a single site or a set of 
internal sites.  Once in a while editors of the site remove a page from the 
site.  When that happens, I want to update at least the index and ideally 
crawldb, linkdb, so that people searching the index don't get the missing page 
in results and end up going there, hitting the 404.

I don't think there is a "direct" way to do this with Nutch, is there?
If there really is no direct way to do this, I was thinking I'd just put the 
URL of the recently removed page into the first next fetchlist and then somehow 
get Nutch to immediately remove that page/URL once it hits a 404.  How does 
that sound?

Is there a way to configure Nutch to delete the page after it gets a 404 for it 
even just once?  I thought I saw the setting for that somewhere a few weeks 
ago, but now I can't find it.

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



  




Re: Wildly different crawl results depending on environment...

2007-04-02 Thread Enis Soztutar

Briggs wrote:

nutch 0.7.2

I have 2 scenarios (both using the exact same configurations):

1) Running the crawl tool from the command line:

   ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5

2) Running the crawl tool from a web app somewhere in code like:

   final String[] args = new String[]{
   "-local", "/tmp/urlfile.txt",
   "-dir", "/tmp/somedir",
   "-depth", "5"};

   CrawlTool.main(args);


When I run the first scenario, I may get thousands of pages, but when
I run the second scenario my results vary wildly.  I mean, I get
perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
within a web application.  So, there are many things that could be
going wrong here

1) Is there some sort of parsing issue?  An xml parser, regex,
timeouts... something?  Not sure.  But, it just won't crawl as well as
the 'standalone mode'.

2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
a crawl tool (more than once) within a instance of a JVM?  It seems to
have problems doing this. I am thinking there are some static
references that don't really like handling such use. But this is just
a wild accusation that I am not sure of.



Checking out the logs might help in this case. From my experience, i can 
say that there can be some classloading problem with the crawl running 
in a servlet container. I suggest you also try running the crawl step 
wise, by first running inject, generate, fetch. etc.






Re: Help on Activation of Subcollection at Indexing & searching

2007-04-02 Thread Enis Soztutar

prashant_nutch wrote:

Hi,
Thanks for your early response.
finally i got search result using subcollection,but still some issues,
1.can we should  search on more than 2 subcollection at same time?
   like command 
   subcollection:  ...
 
 can we extend this as  subcollection: 
 
 or how to achieve this?

  
Actually you can but it requires a little work. Nutch parses the query 
by a predefined syntax using JavaCC generated classes, namely 
NutchAnalysis.java and NutchAnalysis.cc (Also see Query.parse()). 
Unfortunatelly the query syntax does not allow for parsing multiple 
terms for a field. And also the query syntax does not include boolean OR 
operation. So a query like


  : , 

is not possible as well as a query like
 ( : OR :)

So for your case, you can add this functionality to NutchAnalysis so 
share this with the community, so nutch has this wanted feature. 
Alternatively you can add the clauses to the Query object 
programmatically if you know the field a priori.



2.in subcollection if we want adding URLs after crawling,or removing from
subcollection or 
   merging two subcollection, each time we should do new crawl?


  can we manage our subcollection according requirement and we don't want to
recrawl again?(like subcollection A , B. Now we want add some URL from  A
into B)

 

  
like above this is also not an issue of subcollection, but an issue of 
lucene herself. All the subcollection indexing extension does is to add 
a subcollection field to the document with possible values of 
subcollection names. Thus you can do all the operations on the index as 
you like. I suggest you learn more about lucene, by reading their wiki 
or one of the books. Also you can check out Solr, which manages the 
index more dynamically.






Re: Help on Activation of Subcollection at Indexing & searching

2007-03-30 Thread Enis Soztutar

prashant_nutch wrote:

Thanks for your valuable comment on subcollection,
but still i have some issues, 
1.enabling subcollection in nutch-site.xml mean at time of crawling, can it

is possible if it is on direcly on index (means at searching)
  
nutch plugins can implement several extension points. Subcollection 
implements both the IndexingFilter extension point, so that 
subcollections are inserted in the index, and QueryFilter plugin , so 
that you can search in the subcollection field. This means that if you 
enable the subcollection plugin in nutch-site.xml, indexing and querying 
in subcollection field is enabled.

2.in your message can u explain comment like
  subcollection also includes a query plugin
  
by enabling the Subcollection plugin, you can search in the 
subcollection field. For example


 subcollection:

i done steps mentioned by you,
but when i execute command like 


subcollection: 
still i get result 0 hits..
  
You should open your indexes in luke or lucli and check if the urls are 
indexed correctly.

can u explain Subcollection more deeply because our aim is to searching on
specific URL?
  

Check the readme file in the src/plugin/subcollection directory.

is any other way other than subcollection ?
  
I assume that you do want to search on a set of urls(matching a regular 
expression) rathe than a single url. If not, then there is no point in 
using subcollection.




  




Re: Nutch dataset dirstructure

2007-03-30 Thread Enis Soztutar

pike wrote:

Hi

I'm new to nutch.
Can anyone point me to some documentation about
the directory structure Nutch creates and maintains
when crawling, indexing etc ? We're doing "whole-web"
crawls step by step. Since I have no reference, it's
hard to see wether crawling, merging, indexing, etc
went ok.


thanks!
*-pike

Well, unfortunately there is not much document out there. But you should 
start by reading the articles at the nutch wiki first. For the index 
structure you should seek help in the lucene wiki, since nutch uses 
lucene as an inverted index. To look at the generated indexes you can 
use luke or lucli(command line) tools. lucli can be found in the contrib 
directory of lucene.


Nutch stores the crawl state of the urls in the crawldb. The crawldb is 
an instance of Hadoop's MapFile, which is a sequence of  
pairs. The keys in crawldb are urls and values are CrawlDatum objects. 
MapFile uses two SequenceFile s, one for storing the data, the other for 
indexing the data. You should check the javadocs of these classes for 
further info.


Linkdb is also stored as map files, from urls to Inlink objects.
For further info, you should really browse the javadocs, and skim 
through the code to get a deeper understanding of the system.


Re: Help on Activation of Subcollection at Indexing & searching

2007-03-30 Thread Enis Soztutar

prashant_nutch wrote:

IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?

in conf/subcollection , 
if we include our URL in whitelist ,then only we have search on that URLs?

command for searching on subcollection

Subcollection :< Name of subcollection> < word for specific URL>





nutch
nutch

   http://lucene.apache.org/nutch/
   http://wiki.apache.org/nutch/





can anybody explain how overall thing should work ?
can it is useful for specific URL searching ?(we are using nutch 0.8.1)

  
Subcollection is a very useful way to group a set of urls and then 
assign a label for them. You can use it to limit searching to certain urls.


You should first enable subcollection in the nutch-site.xml file.
Then you should add collections to the conf/subcollection.xml file.
After indexing, the documents with the matched urls should have the 
subcollection field in the index.
After that, since subcollection also includes a query plugin, you can do 
searches like


 java subcollection:nutch

To limit the search to the nutch collection. You can consult the readme 
file in the plugin's directory.








Re: Wikia Search Engine? Anyone working on it?

2007-03-26 Thread Enis Soztutar

Sean Dean wrote:

Ive been following it, but haven't posted anything over there. Honestly, if you read a 
lot of the "public" content in the forum and mailing list it provides you with 
absolutely nothing in terms of what they will be doing.
 
Jimmy Wales is still running 100% of the show, and every now and then you see some article with him boosting his "idea". I haven't seen any power or "commands" given to the general (unorganized) community.
 
They might turn out to surprise me, but so far its been ALL public relations. I manage a search engine based on Nutch and Lucene and its still bigger then there's, too bad I cant wheel and deal the same PR monkey's.


 
My search engine is more proof of concept (and a hobby) then anything else, but maybe one day it will generate more then the $20 per month in Google Ad revenue it does now. A free tip for budding search engine creators; go vertical or go home.*
 
* home = http://www.google.com/


 
- Original Message 

From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Sunday, March 25, 2007 1:37:02 AM
Subject: Wikia Search Engine? Anyone working on it?


Just a curiosity.  I saw the new article about the Wikia search engine 
in Fast Company magazine.  Is anybody here working with that project or 
planning on working with it?


Personally I am working on a separate search project but I keep seeing 
that they are going to use Nutch and Lucene.  I haven't really seen 
anybody that has been active on the lists say they are going to be 
involved in the project though?  What is everyone's interest level on this?


Dennis Kubes
  
I also follow the list with no posts. As far as i can tell, the search-i 
mailing list is not *that* active for such a community. The list seems 
to used for idea and link exchange. Since nutch is a possible solution 
for wikia search maybe we should be more  active in the project. Also I 
believe there are some people in the nutch community who has precious 
hands-on xp with an online search engine. And i encourage them to take 
part in wikia search.





Re: help needed : filters in regex-urlfilter.txt

2007-03-21 Thread Enis Soztutar

cha wrote:

Hi,

I want to ignore the following urls from crawling

for eg.

http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
http://www.example.com/stores/abcd/merch-cats/abcd.*
http://www.example.com/stores/abcd/merch/abd.*


I have used regex-urlfilter.txt file  and negate the following urls:


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
-http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
-http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
-http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*

The above filters still don't filters all the urls.

is there any way to solve this..any alternatives??

Awaiting,

Cha



  

Did you enable urlfilter-regex plugin in your configuration?





Re: WARN QueryFilters - QueryFilter: RecommendedQueryFilter :names no fields.

2007-03-21 Thread Enis Soztutar

Ratnesh,V2Solutions India wrote:
Hi, 
when I deployed plugin, inside plugin directory of nutch in tomcat, I got

following warn messages??
one isjava.lang.ArrayIndexOutOfBoundsException: 0
and another is RecommendedQueryFilter :names no fields.
(deleted the rest)
  


Hi, you should define the fields to be queried in the plugin's 
plugin.xml file.

For example take a look at src/plugin/query-basic/plugin.xml :

 
class="org.apache.nutch.searcher.basic.BasicQueryFilter">

   
 

Here, the plugin declares to respond to DEFAULT field, you should add 
the parameter to your plugin.xml file as :


where fieldname[1|2|3] is the name of the field that your query filter 
builds the query upon.






Re: Any way for removing pages with same title in index?

2007-03-21 Thread Enis Soztutar
qi wu wrote:
> Hi,
> I found many pages with the same title , page contents are almost same. I 
> would like to index the pages with the same title only once.How can I 
> recognize the pages with same title during indexing process?
> How do nutch remove pages with same page content and in which class/package 
> can I find the code? 
>
> Thanks
> -Qi
>   
Hi,

Normally, in the nutch processing sequence, after indexing you can run
dedup command to delete the duplicate entries from the index.
DeleteDuplicates class does this in a two phrase manner. In the first
phrase the documents with the same url are deleted and in the second the
documents with the same content are deleted. In your case, I assume that
the document urls are different but the contents are "nearly the same".
Document similarity is computed using either MD5Signature or
TextProfileSignature. md5signature computes a value based on the content
of the page, but if the page's contents are not exactly the same, it
will generate distinct signatures. However TextProfileSignature
generates a signature based on the most frequent terms of the content,
so pages with similar content will generate same signature.

I can recommend two options. First one is to use the
TextProfilSignature(you can change the signiture from the
configuration), the other is to modify the DeleteDuplicates code for
deleting duplicates by the title. IMO former method is more sensible.


Re: extracting urls into text files

2007-03-20 Thread Enis Soztutar

cha wrote:

First of all thanks for your reply.
  

you're welcome.


Am really got confused !! pardon me..
I dont know whether i  need to put the given code by creating new class in
nutch directory?
 Do i have to import other classes or packages..?? any thing i need to take
care of??
  
I can suggest you download eclipse, then using the tutorial on nutch 
wiki called running nutch on eclipse, set up the project. Then for 
example in the org.apache.nutch.tools package create a new class and 
then paste the previously mentioned code.


   //here fs is an instance of FileSystem object, seqFile is a Path to 
the crawldb

   MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);

then in the loop change the below from

out.println(key);

to

out.println("" + key + "");


I have tried creating a new separate class in nutch directory..but gives
lotsa errors related to packages/class not found.Still try to figuring out
whats wrong there.

Secondly How should am able to read the urls from crawldb once the class get
running..I have know idea how should i figure it out..

How can fit the output of my url in some xml format.i.e.

http://www.example.com/
  

http://www.example1.com/
  
...
So can you please elaborate me how should i do this..

Thanks a lot for your time..
  

Well, there is nothing more i can do except write the code my own : )
You can first try to be more familiar with Java programming if need be. 
Good luck

Cheers,
Cha

Enis Soztutar wrote:
  

cha wrote:


Thanks enis,

am getting some idea from that..
Can you tell me in which class i should implement that.
I havent have hadoop install on my box.

  
  
Just  make a new class in nutch and put the code there : ) As long as 
you have hadoop jar in your classpath, you do not need to checkout the 
hadoop codebase.







  




Re: extracting urls into text files

2007-03-19 Thread Enis Soztutar

cha wrote:

Thanks enis,

am getting some idea from that..
Can you tell me in which class i should implement that.
I havent have hadoop install on my box.

  
Just  make a new class in nutch and put the code there : ) As long as 
you have hadoop jar in your classpath, you do not need to checkout the 
hadoop codebase.




Re: extracting urls into text files

2007-03-19 Thread Enis Soztutar
the crawldb is a serialization of a hadoop's 
org.apache.hadoop.io.MapFile object. This structure contains two 
SequenceFiles, one for data and one for index. This is an excerpt from 
the javadoc of the MapFile class:


A file-based map from keys to values.
*
* A map is a directory containing two files, the data file,
* containing all keys and values in the map, and a smaller 
index

* file, containing a fraction of the keys.  The fraction is determined by
* [EMAIL PROTECTED] Writer#getIndexInterval()}.

MapFile.Reader class is for reading the contents of the map file. By 
using this class, you can enumerate all the entries of the map file. And 
since the keys of the crawldb are Text objects containing urls, you can 
just dump the keys one by one to another file. Try the following :



MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);

   Class keyC = reader.getKeyClass();
   Class valueC = reader.getValueClass();

   while (true) {
   WritableComparable key = null;
   Writable value = null;
   try {
   key = (WritableComparable)keyC.newInstance();
   value = (Writable)valueC.newInstance();
   } catch (Exception ex) {
   ex.printStackTrace();
   System.exit(-1);
   }

   try {   
   if (!reader.next(key, value)) {

   break;
   }

   out.println(key);
   out.println(value);
   } catch (Exception e) {
   e.printStackTrace();
   out.println("Exception occured. " + e);
   break;
   }
   }

This code is just for demonstration, of course you can customize it for 
you needs, for example printing in xml format. you can check the 
javadocs of CrawlDatum, Crawldb, Text,  MapFile, SequenceFile classes 
for further insight.



cha wrote:

Hi Enis,

I cant still able to figured it out how it can be done..Can you explain
elaborately.
please..

Regards,
Chandresh

Enis Soztutar wrote:
  

cha wrote:


hi sagar,

Thanks for the reply.

Actually am trying to digg out the code in the same class..but not able
to
figure it out from where Urls has been read.

When you dump the database, the file contains :

http://blog.cha.com/Version: 4
Status: 2 (DB_fetched)
Fetch time: Fri Apr 13 15:58:28 IST 2007
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.062367838
Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
Metadata: null

I figured it out rest of the things but not sure how the Url name has
been
read..

I just want plain urls only  in the text file..It is possible that i can
use
to write url in some xml formats..If yes then how?

Awaiting,

Chandresh

  
  
Hi, crawldb is a actually a map file, which has urls as keys(Text class) 
and CrawlDatum objects as values. You can write a generic map file 
reader and then which extracts the keys and dumps to a file.








  


Re: How to reslove ?? java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required

2007-03-16 Thread Enis Soztutar

Ratnesh,V2Solutions India wrote:

Hi,
  I'm trying to run nutch in eclipse for my simple application, and I'm
getting this error about scoring
plugins. My plugin.includes var contains scoring-opic and the plugin exists
in
the plugins directory, anyone have any ideas what I should look at? Thanks!


  
You should add the conf directoty to eclipse source paths, so that it 
can see the configuration. You can set the logging level to info and 
then check for the added plugins from the log, to see if scoring-opic is 
indeed loaded. Also you can try running from the command line to see if 
the problem is in eclipse or in configuration.


Re: When can I delete segments? (still usefull after indexing?)

2007-03-16 Thread Enis Soztutar

cybercouf wrote:

If I'm not wrong, segments are used by nutch to store parsed data, and after
update the crawldb, and finally build an index.

But when the crawl is finished, for a next recrawl nutch only need the last
crawldb? so not my old segments.
And for building the new index, it only needs my new indexes and the old
index, not the old segs.
(and it seems for the search engine part segment are used just for "show
page cache copy" ?)

It could be nice space saved to delete the segments, but do my argument is
right? 
  
Well, your argument is actually not correct. crawl db only holds the 
information about the crawl status of the url, not the contents. and in 
the index, the contents of the url is not stored, just indexed. So, how 
would you give summaries without the segments? You can delete the 
segments only if you do not need them for cached results, or summaries.






Re: extracting urls into text files

2007-03-16 Thread Enis Soztutar

cha wrote:

hi sagar,

Thanks for the reply.

Actually am trying to digg out the code in the same class..but not able to
figure it out from where Urls has been read.

When you dump the database, the file contains :

http://blog.cha.com/Version: 4
Status: 2 (DB_fetched)
Fetch time: Fri Apr 13 15:58:28 IST 2007
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.062367838
Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
Metadata: null

I figured it out rest of the things but not sure how the Url name has been
read..

I just want plain urls only  in the text file..It is possible that i can use
to write url in some xml formats..If yes then how?

Awaiting,

Chandresh

  
Hi, crawldb is a actually a map file, which has urls as keys(Text class) 
and CrawlDatum objects as values. You can write a generic map file 
reader and then which extracts the keys and dumps to a file.





Re: Hi What is the use of refine-query-init.jsp,refine-query.jsp

2007-03-12 Thread Enis Soztutar

inalasuresh wrote:

Hi ,

I am uncommented the refine-query.jsp and refine-query-init.jsp in the
search.jsp
i searched for bikekeyword it given result.
Before that i am trying to run the application with comments & witout
comments .
but that had given the same result.
so plz any one can sugest me 

what is the use of the 
refine-query-init.jsp & refine-query.jsp.

what is the final result for using without comments for these jsp's

thanx & regards
suresh 
  
These two jsp files are part of the ontology extension point. basically 
plugins extending this extension point(currently ontology plugin) 
implement two function which are getSynonyms() and getSubclasses(). The 
ontolgy plugin thus gives synonyms(from wordnet) and subclasses from the 
defined ontologies for search query refinement.


you should enable ontology plugin and add some ontology url to the 
configuration. you can check the ontology plugin's readme file.





Re: Hi what is the use of subcollections.xml

2007-03-12 Thread Enis Soztutar

inalasuresh wrote:

Hi ,
Any one help me. i am new for nutch..

what is the use of subcollections.xml
when it is called.
plz give the response for my query,...

thanx & regards
suresh..
  

Hi,

Subcollections is a plugin for indexing the urls matching a regular 
expression and subcollections.xml is the configuration file it uses.



   nutch
   nutch
   
http://lucene.apache.org/nutch/

   
   

when this plugin is enabled, nutch adds a field to the index with 
fieldname subcollection, and value nutch for the url  
http://lucene.apache.org/nutch/. refer to the plugin's readme file.


Re: Arabic language in Nutch

2007-03-02 Thread Enis Soztutar

Munir wrote:
Can you please tell me if it is possible to use NGramProfile to create 
arabic profile? if it is ok how? because I tried to run this command 
but I got error:
java org.apache.nutch.analysis.lang.NGramProfile -create   


error :  syntax error near unexpected token `<'

and how I will create ar.ngp file?

Please help me to use the Analysis for Arabic. can you tell me the 
steps one by one?

I am using Nutch0.9dev on lunix with tomcat5.5.20.
thanks in advance

  

Yes you can,

Since Ngram profile uses java.lang.String and that is UTF-16, you can 
create ngram profile for arabic (I suppose arabic character set is 
represented in utf16). You should not use '<' and '>' but instead give 
the command as :


java org.apache.nutch.analysis.lang.NGramProfile -create ar arabic windows-1256


linux uses < and > for redirecting standard input and output.






Re: How can I check (from log file, etc) weather analyzer-(fr|th) is in use?

2007-02-06 Thread Enis Soztutar

Vee Satayamas wrote:

Hello,

How can I check (from log file, etc) weather analyzer-th is in use? I 
have

already modified nutch-site.xml as follow:


 plugin.includes

nutch-extensionpoints|analysis-(fr|th)|analysis-xx|lib-lucene-analyzers|scoring-opic|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url) 


 Plugin


Regards,
Vee


Hi,

search entries of the form :

INFO  indexer.Indexer -  Indexing [] with analyzer 
org.apache.nutch.analysis.NutchDocumentAnalyzer


or whatever analyzer you use. You should enable Info level logging for 
the Indexer class.


Re: Nutch content with Lucene search

2007-01-29 Thread Enis Soztutar

Gilbert Groenendijk wrote:
Thank you (and Brian) for your anwsers. I noticed this to, but i want 
to get
the content with the java API with Lucene 2.0. If it is impossible, i 
have

to write some extensions for my current code but rather not. I guess the
problem is the unstored property. Any config property available for that?

On 1/27/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:



1. Open your index in Luke

2. click on the documents tab

3. click on the next arrow to move to the first document

4. than click on the reconstruct button.

You shall see the content field data in the right pane

HTH

-Original Message-
From: Gilbert Groenendijk [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 27, 2007 8:34 PM
To: nutch-user@lucene.apache.org
Subject: Nutch content with Lucene search

Hello,

Today i created a simple index with nutch by command line. After that i
copied the index to the machine to use it with a lucene envirionment, no
Nutch. Fetching the URL and title works pretty good but how can i get 
the

content? if i tak a look in Luke, the field content is not stored or
tokenized but when i look in nutch-default.xml and nutch-site.xml, i 
have

definied:


fetcher.store.content
true
If true, fetcher will store content.


it doesn't seem to work, any idea's?


--
Gilbert Groenendijk
__






Just change the 72nd line in BasicIndexingFilter in index-basic plugin from

doc.add(new Field("content", parse.getText(), Field.Store.NO, 
Field.Index.TOKENIZED));


to

doc.add(new Field("content", parse.getText(), Field.Store.YES, 
Field.Index.TOKENIZED));



and you are done. But remember that you do not need to store the content 
to search it.




Re: Can I generate nutch index without crawling?

2007-01-25 Thread Enis Soztutar

Scott Green wrote:

On 1/24/07, Sean Dean <[EMAIL PROTECTED]> wrote:

What exactly are you looking to do?

If you don't crawl for anything, then what data are you looking to 
index?


You can certainly take some other persons Nutch segment (that they 
crawled) and then index it yourself, on your machines.



Hi

My requirements only for debugging. Let pseudocodes talk what i 
exactly need:


//File content = new File()
String[] content = new String[]{".", "..."};
debugTool.generateIndex(content);



- Original Message 
From: Scott Green <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Tuesday, January 23, 2007 12:08:31 PM
Subject: Can I generate nutch index without crawling?


Hi,

I am now debugging nutch searcher and wondering can I generate nutch
index without crawling? If yes, can you give me some hints? Thanks.



nutch uses lucene for indexing. you can use lucene api for creating an 
index from any content.

1. open an index writer
2. create documents
3. add the documents to the index.
4. close the indexwriter
5. open the indexes with indexReader if you want.


you can look at lucene api, or the book called Lucene in Action.






Re: Boolean searches, again

2007-01-24 Thread Enis Soztutar

Nicolás Lichtmaier wrote:

Now I know that Nutch doesn't support boolean queries. I've found this:

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06320.html

But this seems to be for a previous version of Nutch.

Could someone give me a hint about conducting a boolean search by 
using the Lucene/Nutch API's directly? Just some starting points to 
look at.


Thanks!



Hi,

Nutch does not support boolean queries as lucene does. Only the minus 
operator (-) is supported to exclude words from the search. If you want 
boolean query support, then you should modify the Query.java. In this 
class there are subclasses called Clause, Term and Phrase. A clause is 
either a term or a phrase. This class is constructed by the 
Query.parse() method. Parse method delegates to the 
NutchAnalysis.parseQuery(). NutchAnaylsis is generated from 
NutchAnalysis.jj. This JavaCC document lexical analysis and parsing. And 
finally, QueryFilters.filter method run all query filters through the 
Query and these filters convert the nutch Query to lucene BooleanQuery. 
You should definitely check query-basic for this.


To add boolean query support (esp. OR ) you need to modify all the above 
classes in some way : )


Alternatively, you can just construct the Boolean Query and the pass it 
to the index servers bypassing nutch Query class.




Re: How to index and return files names ?

2007-01-10 Thread Enis Soztutar

Alan Tanaman wrote:

Arnaud,

Absolutely.  As Nutch comes, the url field is searchable (and tokenized).
You predicate the search to a specific field using a colon, for example by
typing

url:motherboard or url:"unix shell"

The default search field (when no predicate is specified) is content.

Generally the Lucene search syntax is supported (although I believe there
are Nutch specific issues):
http://lucene.apache.org/java/docs/queryparsersyntax.html

Best regards,
Alan
_
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com

-Original Message-
From: Arnaud Goupil [mailto:[EMAIL PROTECTED] 
Sent: 10 January 2007 10:04

To: nutch-user@lucene.apache.org
Subject: How to index and return files names ?

Hi,

I would like Nutch to return results when search terms
are found in the name of files known by the index.

For example, my http location indexed by nutch
contains various files, named :


computer security.pdf
unix shell.pdf
motherboard specifications.pdf


If I search "motherboard", I want Nutch to return a
result pointing to my third document, even if this
document does not contain the word "motherboard", only
because it's in the name of the file.

Is there a way to do this ?

Thanks

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
possible contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 



  
As Alan suggested, you should search the url field. For searching the 
url field, you should include query-url plugin. But query-basic also 
queries the url field without adding the url: prefix to the query.
Also I suggest you to use the URLTokenizer in the 
http://issues.apache.org/jira/browse/NUTCH-389, which tokenizes the urls 
better.






Re: Plugins for features

2007-01-03 Thread Enis Soztutar

karthik085 wrote:

What nutch plugins are available, that can do a similar job to these
following Google features? (More about google features:
http://www.google.com/advanced_search?hl=en)
* File format :
* Date
* Domain
* Topic-specific searches (Web/Images/Video...)
* Search within results
* Q/A (For example, 'weather 60004' gives weather data for Arlington Height,
IL)
* Suggest
* Did you mean?
* Similar pages
* Analytics

Are there any of these features already implemented in Nutch? Any other way,
without using plugins? With what version does these plugins work with?
  

Hi,

Well not all of them is implemented in nutch. Obviously, this is because 
some of the tasks is very challenging and some of them could be a 
project themselves.


To start with, you can index file formats and dates by using index-more 
plugin. And these can be queried with query-more plugin. Topic specific 
searches can be imitated by searching on mime type fields.  However, 
this is not a straightforward solution. Searching within results is not 
implemented either, although it is not difficult.



Question answering is some broad topic. Hakia and Start are two 
references. But as far as i understood, by Q/A you refer to googles 
solution. Google forwards the query to, say whether server or finance 
server, by a query dispatcher and displays the results along with the 
regular query results. To implement such a feature, you should have the 
whether, finance, or say music data. As far as I know, this is not one 
of the project goals at nutch.


Spell checker is implemented under contrib/web2 directory.




Re: Crawling from a different "conf" directory location.

2006-12-25 Thread Enis Soztutar

Julien wrote:

Hello,

just do a :
export NUTCH_CONF_DIR=/_your_conf_path/

Julien

Nearly all the classes used for crawling(Injector, Generator, Fetcher, 
Indexer, etc ) extend org.apache.hadoop.util.Toolbase class, which 
ensures that the class can take some optional command line arguments. 
Below is the javadoc of the class:


This is a base class to support generic commonad options.
* Generic command options allow a user to specify a namenode,
* a job tracker etc. Generic options supported are
* -conf  specify an application configuration file
* -D use value for given property
* -fs   specify a namenode
* -jt specify a job tracker
*
* The general command line syntax is
* bin/hadoop command [genericOptions] [commandOptions]
*
* For every tool that inherits from ToolBase, generic options are
* handled by ToolBase while command options are passed to the tool.
* Generic options handling is implemented using Common CLI.
*
* Tools that inherit from ToolBase in Hadoop are
* DFSShell, DFSck, JobClient, and CopyFiles.
*
* Examples using generic options are
* bin/hadoop dfs -fs darwin:8020 -ls /data
* list /data directory in dfs with namenode darwin:8020
* bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data
* list /data directory in dfs with namenode darwin:8020
* bin/hadoop dfs -conf hadoop-site.xml -ls /data
* list /data directory in dfs with conf specified in hadoop-site.xml
* bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml
* submit a job to job tracker darwin:50020
* bin/hadoop job -jt darwin:50020 -submit job.xml
* submit a job to job tracker darwin:50020
* bin/hadoop job -jt local -submit job.xml
* submit a job to local runner


Re: Written a plugin: now nutch fails with an error

2006-11-16 Thread Enis Soztutar

Nicolás Lichtmaier wrote:

I've written a plugin and now nutch fails with an error:

# bin/nutch generate /var/lib/tomcat/crawl/crawldb 
/var/lib/tomcat/crawl/segments -topN 100

topN: 100
Generator: starting
Generator: segment: /var/lib/tomcat/crawl/segments/20061116101759
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
   at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
   at org.apache.nutch.crawl.Generator.main(Generator.java:395)

But I don't know how to debug this? How do I see more debug output?

Any help would be appreciated. Thanks!


Nicolas, you could find help if you were more specific. Which plugin 
point did you extend and which version of nutch/hadoop are you using?  
conf/log4j.properties file determines the logging factor. Maybe setting 
logging level to INFO or DEBUG may help.


Re: near duplicates

2006-10-30 Thread Enis Soztutar

John Casey wrote:

On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:


Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested 
that

I
> could look at the TermVectors and do a comparision to remove the
> duplicates.

As an alternative you could also have a look at the paper "Detecting
Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark
Manasse, Marc Najork.



Another good reference would be Soumen Chakrabarti's reference book, 
"Mining
the Web - Discovering Knowledge from Hypertext Data",2003 and the 
section on

shingling and the elimination of near duplicates. Of course I think this
works at the document level rather than at the term vector level but it
might be useful to prevent duplicate documents from being indexed
altogether.


One major problem with this is the structure of the document is
> no longer important. Are there any obvious pitfalls? For example:
Document
> A being a subset of Document B but in no particular order.

I think this case is pretty unlikely. But I am not sure whether you can
detect
near duplicates by only comparing term-document vectors. There might be
problems with documents with slightly changed words, words that were
replaced
with synonyms...

However, if you want to keep some information on the word order, you 
might
consider comparing n-gram document vectors. That is, each dimension 
in the


vector does not only represent one word but a sequence of 2, 3, 4, 5...
words.




would this involve something like a window of 2-5 words around a 
particular

term in a document?

Cheers,

Isabel





DeleteDuplicates removes documents having the same digest or the same 
url. If you use the TextProfileSigniture instead of MD5Signiture, it 
will remove near similar documents. The MD5Signiture class set digest as 
the md5 of all the content, whereas textProfileSigniture sets digest as 
the md5 of significant terms. You should check the class for 
implementation details.  also look at signitureFactory for how to change 
the configuration.




Re: Problem in URL tokenization

2006-09-27 Thread Enis Soztutar

Vishal Shah wrote:

Hi,
 
   If I understand correctly, there is a common tokenizer for all fields

(URL, content, meta etc.). This tokenizer does not use the underscore
character as a separator. Since a lot of URLs use underscore to separate
different words, it would be better if the URLs are tokenized slightly
differently from the other fields. I tried looking at the
NutchDocumentAnalyzer and related files, but can't figure out a clear
way to implement a new tokenizer for URLs only. Any ideas as to how to
go about doing this?
 
Thanks,
 
-vishal.


  
hi, it is not straightforward to implement this without modifying 
default tokenizing behavior,


first you should copy the NutchAnalysis.jj to URLAnalysis.jj (or 
something you like) and change
| <#WORD_PUNCT: ("_"|"&")>  
to :
| <#WORD_PUNCT: ("&")>  
and recompile with javaCC.


then, you should copy NutchDocumentTokenizer to URLTokenizer, and 
refactor NutchAnalysisTokenManager instances to URLAnalysisTokenManager 
instance,


then you should write an Analyzer like to

private static class URLAnalyzer extends Analyzer {

public URLAnalyzer(){

}

public TokenStream tokenStream(String field, Reader reader) {
   
   return new URLTokenizer(reader);

   }
}

and finally, you change NutchDocumentAnalyzer

   if ("anchor".equals(fieldName))
 analyzer = ANCHOR_ANALYZER;
   else
 analyzer = CONTENT_ANALYZER;

to

   if ("anchor".equals(fieldName))
 analyzer = ANCHOR_ANALYZER;
   else if("url".equals(fieldName))
   analyzer = URL_ANALYZER;
   else
 analyzer = CONTENT_ANALYZER;

assuming URL_ANALYZER is an instance of URLAnalyzer

I have not tested this but it should work as expected.



Re: term frequency

2006-09-26 Thread Enis Soztutar

Chris K Wensel wrote:

Hi all

I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.

for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys

so  what is the 'correct' way to inspect the nutch index for these values.
Particularly against the lucene IndexReader behind the nutch IndexSearcher.
Since I don't see anything on the Searcher interface, is there some other
hadoop-ified way to do this?

assuming there isn't, if I was to add the ability to get document and index
wide term frequencies, would this be exposed on the nutch.searcher.Searcher
interface? 

e.g. 


Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
Searcher.getTermVector( Hit hit, String field )
Searcher.getTermVector( String field )

or is there a more relevant interface this should hang off of? Searcher
doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
IndexTermVector??

or is this just insane, it won't work like I think and I should just forget
trying to get corpus relevant info from the indexes during runtime?

cheers,
ckw


  

Hi,

For some statistical analysis, I also needed term frequencies across all 
the collection,
Since lucene only gives termfreq by document, I have calculated the term 
frequencies by

summing all the frequencies of the term. the below code fragment does this:

   /**
* Returns total occurrences of the given term.
* @param term
* @return #of occurrences of term.
* @throws IOException
*/
   private int getCount(Term term) throws IOException{
   int count = 0;
   TermDocs termDocs = reader.termDocs(term);
   while(termDocs.next()) {
   count += termDocs.freq();
   }
   return count;
   }


But, this method is inefficient, since it recalculates the value 
everytime it is called. So a caching mechanism will prove useful. 
Alternatively, you may initially build an HashMap and store the frequency> info in it.