Hi,
please see below
Vinci wrote:
Hi everybody,
I am trying to use nutch to implement my spider algorithm...I need to get
information from specific resources, then schedule the crawling based on the
link it found (i.e. nutch will be an link analyzer as well as crawler)
Question here:
1. How
Hi,
To clarify things a bit, let me explain lucene and her children a bit.
Lucene : an inverted indexing library,
Solr : a kind of index server application, that wraps and extends
the capabilities of lucene.
Hadooop : an implementation of mapreduce and DFS
Nutch : a search engine build
Dennis,
Have you tried using o.a.lucene.store.RAMDirectory instead of tempfs.
Intuitively I believe RAMDirectory should be faster, isn't it ? Do you
have any benchmark for the two?
Dennis Kubes wrote:
Trey Spiva wrote:
According to a hadoop tutorial
Hadoop has been running df for a long time way before 0.13. You can run
hadoop under cygwin ın windows. Please refer to Hadoop's documentation.
Tim Gautier wrote:
I do my nutch development and debugging on a Windows XP machine before
transferring my jar files to a Linux cluster for actual
Hi,
Unfortunately QueryParser used in nutch will not parse queries of the
form site:site1,site2 , but i've hit the same problem and started
working on it. I will create a jira issue for this. You can refer there.
karthik085 wrote:
Hi,
To search a query from a particular domain from the
Yes, you have to do it manually for now, but it is not so complicated to
reopen the index if it is changed, using IndexReader's methods.
We are using start-stop daemon to start/stop the index servers. Daemon
can save the pid in a file and then you can kill the process with the
given pid.
Technically, the fragment is a part of the url, but foo and foo#bar
points to the same location, so it should be stripped out. Are you using
url-normalizers. If not could you please try them.
Carl Cerecke wrote:
Hi,
I noticed that urls with a # in them are not handled any differently
to
Linkdb contains all the information about the web graph. After fetching
the segments, you should run bin/nutch invertlinks to build the linkdb,
which is a MapFile. The entries in the MapFile are key,value pairs,
where keys are Text objects(containing urls) and values are Inlinks
objects. In
enabled plugins that implement IndexingFilter are run for each file to
generate the fields to index. enabled plugins can be found in
conf/nutch-default.xml or conf/nutch-site.xml.
You can look at http://wiki.apache.org/nutch/IndexStructure.
Kai_testing Middleton wrote:
Not sure ... this is
Doğacan Güney wrote:
On 6/28/07, Robert Young [EMAIL PROTECTED] wrote:
Hi,
Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org
There is some sort of stemming for German and French languages
(available as plugin analysis-de and
i suggest you first open the index with luke and check that the encoding
is detected correct, and make a search from luke to see if you get any
answers. Then you may invoke org.apache.nutch.searcher.Query to see if
you query is parsed and translated correctly. Finally, you may check
tomcat
Michael Böckling wrote:
What you should do is to compare the structure nutch uses with the
structure you use, and somehow combine the two. In most of
the fields,
you sould converge to the nutch version. Other than that,
once index the
index is created from nutch, it is lucene stuff. You can
Michael Böckling wrote:
Yes Nutch uses a Query class different then lucene. The query is also
parsed differently,
What nutch does basically is that, nutch parses the query with
Query.parse, then it runs
all the query plugins, which convert the nutch query to
lucene boolean
query. Then this
Since hadoop's map files are write once, it is not possible to delete
some urls from the crawldb and linkdb. The only thing you can do is to
create the map files once again without the deleted urls. But running
the crawl once more as you suggested seems more appropriate. Deleting
documents
Great work, could you just post these into the nutch wiki as a step by
step tutorial to new comers.
zzcgiacomini wrote:
I have spent sometime playing with nutch-0 and collecting notes from
the mailing lists ...
may be someone will find these notes useful end could point me out
mistakes
I am
Andrzej Bialecki wrote:
[EMAIL PROTECTED] wrote:
Hi Enis,
Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly. However, I don't want that page to
re-appear in one of the
prashant_nutch wrote:
Hi,
Thanks for your early response.
finally i got search result using subcollection,but still some issues,
1.can we should search on more than 2 subcollection at same time?
like command
subcollection:subcollection name1 term for search ...
can we extend this
Briggs wrote:
nutch 0.7.2
I have 2 scenarios (both using the exact same configurations):
1) Running the crawl tool from the command line:
./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
2) Running the crawl tool from a web app somewhere in code like:
final String[]
prashant_nutch wrote:
IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?
in conf/subcollection ,
if we include our URL in whitelist ,then only we have search on that URLs?
command for searching on subcollection
Subcollection :
pike wrote:
Hi
I'm new to nutch.
Can anyone point me to some documentation about
the directory structure Nutch creates and maintains
when crawling, indexing etc ? We're doing whole-web
crawls step by step. Since I have no reference, it's
hard to see wether crawling, merging, indexing, etc
went
Sean Dean wrote:
Ive been following it, but haven't posted anything over there. Honestly, if you read a
lot of the public content in the forum and mailing list it provides you with
absolutely nothing in terms of what they will be doing.
Jimmy Wales is still running 100% of the show, and
qi wu wrote:
Hi,
I found many pages with the same title , page contents are almost same. I
would like to index the pages with the same title only once.How can I
recognize the pages with same title during indexing process?
How do nutch remove pages with same page content and in which
Ratnesh,V2Solutions India wrote:
Hi,
when I deployed plugin, inside plugin directory of nutch in tomcat, I got
following warn messages??
one isjava.lang.ArrayIndexOutOfBoundsException: 0
and another is RecommendedQueryFilter :names no fields.
(deleted the rest)
Hi, you should define
cha wrote:
Hi,
I want to ignore the following urls from crawling
for eg.
http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
http://www.example.com/stores/abcd/merch-cats/abcd.*
http://www.example.com/stores/abcd/merch/abd.*
I have used regex-urlfilter.txt file and negate the following
cha wrote:
Thanks enis,
am getting some idea from that..
Can you tell me in which class i should implement that.
I havent have hadoop install on my box.
Just make a new class in nutch and put the code there : ) As long as
you have hadoop jar in your classpath, you do not need to checkout
check the
javadocs of CrawlDatum, Crawldb, Text, MapFile, SequenceFile classes
for further insight.
cha wrote:
Hi Enis,
I cant still able to figured it out how it can be done..Can you explain
elaborately.
please..
Regards,
Chandresh
Enis Soztutar wrote:
cha wrote:
hi sagar
cybercouf wrote:
If I'm not wrong, segments are used by nutch to store parsed data, and after
update the crawldb, and finally build an index.
But when the crawl is finished, for a next recrawl nutch only need the last
crawldb? so not my old segments.
And for building the new index, it only
inalasuresh wrote:
Hi ,
I am uncommented the refine-query.jsp and refine-query-init.jsp in the
search.jsp
i searched for bikekeyword it given result.
Before that i am trying to run the application with comments witout
comments .
but that had given the same result.
so plz any one can sugest
inalasuresh wrote:
Hi ,
Any one help me. i am new for nutch..
what is the use of subcollections.xml
when it is called.
plz give the response for my query,...
thanx regards
suresh..
Hi,
Subcollections is a plugin for indexing the urls matching a regular
expression and subcollections.xml
Munir wrote:
Can you please tell me if it is possible to use NGramProfile to create
arabic profile? if it is ok how? because I tried to run this command
but I got error:
java org.apache.nutch.analysis.lang.NGramProfile -create ar arabic
windows-1256
error : syntax error near unexpected token
Vee Satayamas wrote:
Hello,
How can I check (from log file, etc) weather analyzer-th is in use? I
have
already modified nutch-site.xml as follow:
property
nameplugin.includes/name
Gilbert Groenendijk wrote:
Thank you (and Brian) for your anwsers. I noticed this to, but i want
to get
the content with the java API with Lucene 2.0. If it is impossible, i
have
to write some extensions for my current code but rather not. I guess the
problem is the unstored property. Any
Scott Green wrote:
On 1/24/07, Sean Dean [EMAIL PROTECTED] wrote:
What exactly are you looking to do?
If you don't crawl for anything, then what data are you looking to
index?
You can certainly take some other persons Nutch segment (that they
crawled) and then index it yourself, on your
Nicolás Lichtmaier wrote:
Now I know that Nutch doesn't support boolean queries. I've found this:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06320.html
But this seems to be for a previous version of Nutch.
Could someone give me a hint about conducting a boolean search by
karthik085 wrote:
What nutch plugins are available, that can do a similar job to these
following Google features? (More about google features:
http://www.google.com/advanced_search?hl=en)
* File format :
* Date
* Domain
* Topic-specific searches (Web/Images/Video...)
* Search within results
*
Julien wrote:
Hello,
just do a :
export NUTCH_CONF_DIR=/_your_conf_path/
Julien
Nearly all the classes used for crawling(Injector, Generator, Fetcher,
Indexer, etc ) extend org.apache.hadoop.util.Toolbase class, which
ensures that the class can take some optional command line arguments.
John Casey wrote:
On 10/18/06, Isabel Drost [EMAIL PROTECTED] wrote:
Find Me wrote:
How to eliminate near duplicates from the index? Someone suggested
that
I
could look at the TermVectors and do a comparision to remove the
duplicates.
As an alternative you could also have a look at the
Vishal Shah wrote:
Hi,
If I understand correctly, there is a common tokenizer for all fields
(URL, content, meta etc.). This tokenizer does not use the underscore
character as a separator. Since a lot of URLs use underscore to separate
different words, it would be better if the URLs are
Chris K Wensel wrote:
Hi all
I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.
for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys
so what is the 'correct' way to inspect the nutch index for these
39 matches
Mail list logo