Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Enis Soztutar
Hi, please see below Vinci wrote: Hi everybody, I am trying to use nutch to implement my spider algorithm...I need to get information from specific resources, then schedule the crawling based on the link it found (i.e. nutch will be an link analyzer as well as crawler) Question here: 1. How

Re: Question about nutch and solr

2007-12-07 Thread Enis Soztutar
Hi, To clarify things a bit, let me explain lucene and her children a bit. Lucene : an inverted indexing library, Solr : a kind of index server application, that wraps and extends the capabilities of lucene. Hadooop : an implementation of mapreduce and DFS Nutch : a search engine build

Re: Hadoop distributed search.

2007-12-07 Thread Enis Soztutar
Dennis, Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs. Intuitively I believe RAMDirectory should be faster, isn't it ? Do you have any benchmark for the two? Dennis Kubes wrote: Trey Spiva wrote: According to a hadoop tutorial

Re: Hadoop .15 and eclipse on windows

2007-11-09 Thread Enis Soztutar
Hadoop has been running df for a long time way before 0.13. You can run hadoop under cygwin ın windows. Please refer to Hadoop's documentation. Tim Gautier wrote: I do my nutch development and debugging on a Windows XP machine before transferring my jar files to a Linux cluster for actual

Re: Multiple Domains Search

2007-11-04 Thread Enis Soztutar
Hi, Unfortunately QueryParser used in nutch will not parse queries of the form site:site1,site2 , but i've hit the same problem and started working on it. I will create a jira issue for this. You can refer there. karthik085 wrote: Hi, To search a query from a particular domain from the

Re: distributed search server

2007-09-27 Thread Enis Soztutar
Yes, you have to do it manually for now, but it is not so complicated to reopen the index if it is changed, using IndexReader's methods. We are using start-stop daemon to start/stop the index servers. Daemon can save the pid in a file and then you can kill the process with the given pid.

Re: How to treat # in URLs?

2007-08-14 Thread Enis Soztutar
Technically, the fragment is a part of the url, but foo and foo#bar points to the same location, so it should be stripped out. Are you using url-normalizers. If not could you please try them. Carl Cerecke wrote: Hi, I noticed that urls with a # in them are not handled any differently to

Re: getting document link graph

2007-07-25 Thread Enis Soztutar
Linkdb contains all the information about the web graph. After fetching the segments, you should run bin/nutch invertlinks to build the linkdb, which is a MapFile. The entries in the MapFile are key,value pairs, where keys are Text objects(containing urls) and values are Inlinks objects. In

Re: IndexFilter

2007-07-19 Thread Enis Soztutar
enabled plugins that implement IndexingFilter are run for each file to generate the fields to index. enabled plugins can be found in conf/nutch-default.xml or conf/nutch-site.xml. You can look at http://wiki.apache.org/nutch/IndexStructure. Kai_testing Middleton wrote: Not sure ... this is

Re: Stemming with Nutch

2007-06-28 Thread Enis Soztutar
Doğacan Güney wrote: On 6/28/07, Robert Young [EMAIL PROTECTED] wrote: Hi, Are the Nutch Stemming modifications available as a patch? I can't seem to find anything on issue.apache.org There is some sort of stemming for German and French languages (available as plugin analysis-de and

Re: Weird encoding problem

2007-06-26 Thread Enis Soztutar
i suggest you first open the index with luke and check that the encoding is detected correct, and make a search from luke to see if you get any answers. Then you may invoke org.apache.nutch.searcher.Query to see if you query is parsed and translated correctly. Finally, you may check tomcat

Re: AW: Combining standard Lucene and Nutch

2007-04-11 Thread Enis Soztutar
Michael Böckling wrote: What you should do is to compare the structure nutch uses with the structure you use, and somehow combine the two. In most of the fields, you sould converge to the nutch version. Other than that, once index the index is created from nutch, it is lucene stuff. You can

Re: AW: AW: Combining standard Lucene and Nutch

2007-04-11 Thread Enis Soztutar
Michael Böckling wrote: Yes Nutch uses a Query class different then lucene. The query is also parsed differently, What nutch does basically is that, nutch parses the query with Query.parse, then it runs all the query plugins, which convert the nutch query to lucene boolean query. Then this

Re: Removing pages from index immediately

2007-04-05 Thread Enis Soztutar
Since hadoop's map files are write once, it is not possible to delete some urls from the crawldb and linkdb. The only thing you can do is to create the map files once again without the deleted urls. But running the crawl once more as you suggested seems more appropriate. Deleting documents

Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Enis Soztutar
Great work, could you just post these into the nutch wiki as a step by step tutorial to new comers. zzcgiacomini wrote: I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am

Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread Enis Soztutar
Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: Hi Enis, Right, I can easily delete the page from the Lucene index, though I'd prefer to follow the Nutch protocol and avoid messing something up by touching the index directly. However, I don't want that page to re-appear in one of the

Re: Help on Activation of Subcollection at Indexing searching

2007-04-02 Thread Enis Soztutar
prashant_nutch wrote: Hi, Thanks for your early response. finally i got search result using subcollection,but still some issues, 1.can we should search on more than 2 subcollection at same time? like command subcollection:subcollection name1 term for search ... can we extend this

Re: Wildly different crawl results depending on environment...

2007-04-02 Thread Enis Soztutar
Briggs wrote: nutch 0.7.2 I have 2 scenarios (both using the exact same configurations): 1) Running the crawl tool from the command line: ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5 2) Running the crawl tool from a web app somewhere in code like: final String[]

Re: Help on Activation of Subcollection at Indexing searching

2007-03-30 Thread Enis Soztutar
prashant_nutch wrote: IS Subcollection useful for specific URL Searching ? How we activate subcollection at indexing and searching time? in conf/subcollection , if we include our URL in whitelist ,then only we have search on that URLs? command for searching on subcollection Subcollection :

Re: Nutch dataset dirstructure

2007-03-30 Thread Enis Soztutar
pike wrote: Hi I'm new to nutch. Can anyone point me to some documentation about the directory structure Nutch creates and maintains when crawling, indexing etc ? We're doing whole-web crawls step by step. Since I have no reference, it's hard to see wether crawling, merging, indexing, etc went

Re: Wikia Search Engine? Anyone working on it?

2007-03-26 Thread Enis Soztutar
Sean Dean wrote: Ive been following it, but haven't posted anything over there. Honestly, if you read a lot of the public content in the forum and mailing list it provides you with absolutely nothing in terms of what they will be doing. Jimmy Wales is still running 100% of the show, and

Re: Any way for removing pages with same title in index?

2007-03-21 Thread Enis Soztutar
qi wu wrote: Hi, I found many pages with the same title , page contents are almost same. I would like to index the pages with the same title only once.How can I recognize the pages with same title during indexing process? How do nutch remove pages with same page content and in which

Re: WARN QueryFilters - QueryFilter: RecommendedQueryFilter :names no fields.

2007-03-21 Thread Enis Soztutar
Ratnesh,V2Solutions India wrote: Hi, when I deployed plugin, inside plugin directory of nutch in tomcat, I got following warn messages?? one isjava.lang.ArrayIndexOutOfBoundsException: 0 and another is RecommendedQueryFilter :names no fields. (deleted the rest) Hi, you should define

Re: help needed : filters in regex-urlfilter.txt

2007-03-21 Thread Enis Soztutar
cha wrote: Hi, I want to ignore the following urls from crawling for eg. http://www.example.com/stores/abcd/merch-cats-pg/abcd.* http://www.example.com/stores/abcd/merch-cats/abcd.* http://www.example.com/stores/abcd/merch/abd.* I have used regex-urlfilter.txt file and negate the following

Re: extracting urls into text files

2007-03-20 Thread Enis Soztutar
cha wrote: Thanks enis, am getting some idea from that.. Can you tell me in which class i should implement that. I havent have hadoop install on my box. Just make a new class in nutch and put the code there : ) As long as you have hadoop jar in your classpath, you do not need to checkout

Re: extracting urls into text files

2007-03-19 Thread Enis Soztutar
check the javadocs of CrawlDatum, Crawldb, Text, MapFile, SequenceFile classes for further insight. cha wrote: Hi Enis, I cant still able to figured it out how it can be done..Can you explain elaborately. please.. Regards, Chandresh Enis Soztutar wrote: cha wrote: hi sagar

Re: When can I delete segments? (still usefull after indexing?)

2007-03-16 Thread Enis Soztutar
cybercouf wrote: If I'm not wrong, segments are used by nutch to store parsed data, and after update the crawldb, and finally build an index. But when the crawl is finished, for a next recrawl nutch only need the last crawldb? so not my old segments. And for building the new index, it only

Re: Hi What is the use of refine-query-init.jsp,refine-query.jsp

2007-03-13 Thread Enis Soztutar
inalasuresh wrote: Hi , I am uncommented the refine-query.jsp and refine-query-init.jsp in the search.jsp i searched for bikekeyword it given result. Before that i am trying to run the application with comments witout comments . but that had given the same result. so plz any one can sugest

Re: Hi what is the use of subcollections.xml

2007-03-12 Thread Enis Soztutar
inalasuresh wrote: Hi , Any one help me. i am new for nutch.. what is the use of subcollections.xml when it is called. plz give the response for my query,... thanx regards suresh.. Hi, Subcollections is a plugin for indexing the urls matching a regular expression and subcollections.xml

Re: Arabic language in Nutch

2007-03-02 Thread Enis Soztutar
Munir wrote: Can you please tell me if it is possible to use NGramProfile to create arabic profile? if it is ok how? because I tried to run this command but I got error: java org.apache.nutch.analysis.lang.NGramProfile -create ar arabic windows-1256 error : syntax error near unexpected token

Re: How can I check (from log file, etc) weather analyzer-(fr|th) is in use?

2007-02-06 Thread Enis Soztutar
Vee Satayamas wrote: Hello, How can I check (from log file, etc) weather analyzer-th is in use? I have already modified nutch-site.xml as follow: property nameplugin.includes/name

Re: Nutch content with Lucene search

2007-01-29 Thread Enis Soztutar
Gilbert Groenendijk wrote: Thank you (and Brian) for your anwsers. I noticed this to, but i want to get the content with the java API with Lucene 2.0. If it is impossible, i have to write some extensions for my current code but rather not. I guess the problem is the unstored property. Any

Re: Can I generate nutch index without crawling?

2007-01-25 Thread Enis Soztutar
Scott Green wrote: On 1/24/07, Sean Dean [EMAIL PROTECTED] wrote: What exactly are you looking to do? If you don't crawl for anything, then what data are you looking to index? You can certainly take some other persons Nutch segment (that they crawled) and then index it yourself, on your

Re: Boolean searches, again

2007-01-24 Thread Enis Soztutar
Nicolás Lichtmaier wrote: Now I know that Nutch doesn't support boolean queries. I've found this: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06320.html But this seems to be for a previous version of Nutch. Could someone give me a hint about conducting a boolean search by

Re: Plugins for features

2007-01-03 Thread Enis Soztutar
karthik085 wrote: What nutch plugins are available, that can do a similar job to these following Google features? (More about google features: http://www.google.com/advanced_search?hl=en) * File format : * Date * Domain * Topic-specific searches (Web/Images/Video...) * Search within results *

Re: Crawling from a different conf directory location.

2006-12-25 Thread Enis Soztutar
Julien wrote: Hello, just do a : export NUTCH_CONF_DIR=/_your_conf_path/ Julien Nearly all the classes used for crawling(Injector, Generator, Fetcher, Indexer, etc ) extend org.apache.hadoop.util.Toolbase class, which ensures that the class can take some optional command line arguments.

Re: near duplicates

2006-10-30 Thread Enis Soztutar
John Casey wrote: On 10/18/06, Isabel Drost [EMAIL PROTECTED] wrote: Find Me wrote: How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. As an alternative you could also have a look at the

Re: Problem in URL tokenization

2006-09-27 Thread Enis Soztutar
Vishal Shah wrote: Hi, If I understand correctly, there is a common tokenizer for all fields (URL, content, meta etc.). This tokenizer does not use the underscore character as a separator. Since a lot of URLs use underscore to separate different words, it would be better if the URLs are

Re: term frequency

2006-09-26 Thread Enis Soztutar
Chris K Wensel wrote: Hi all I'm interested in playing with term frequency values in a nutch index on a per document and index wide scope. for example, something similar to this lucene faq entry. http://tinyurl.com/ra3ys so what is the 'correct' way to inspect the nutch index for these