Re: Update on ignoring menu divs

2010-02-28 Thread Sami Siren
quality it would be nice to have that wrapped as a plugin in Nutch. -- Sami Siren

Re: Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12

2010-02-24 Thread Sami Siren
Hannu, Do you use same set of QueryFilters both in the webapp and when running from shell? Perhaps your filter is not executed when running from cli? You can verify how your query is transformed by running bin/nutch org.apache.nutch.searcher.Query and entering some queries. -- Sami Siren

Re: Content storage, results highlighting

2010-02-24 Thread Sami Siren
The schema.xml file there is usable only when using Solr as the search server. Are you using Solr? -- Sami Siren Pedro Bezunartea López wrote: Hi, I've developed a web application in lucene that searches web pages using a nutch generated index. I'd like to highlight the query searched

Re: Nutch near future - strategic directions

2009-11-26 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should

Re: Nutch near future - strategic directions

2009-11-18 Thread Sami Siren
efficient and understandable if the foundation (eg. data structures, extendability for example) was in better shape. Also if written nicely other projects could use them too! -- Sami Siren Andrzej Bialecki wrote: Hi all, The ApacheCon is over, our release 1.0 has been out already for some time

Re: Fetcher2 Slow

2009-03-30 Thread Sami Siren
connections. 2. Your machine has ip6 enabled. This I noticed more recently when I was wondering relatively slow fetching speed on a box. After disabling ipv6 totally I was able to fetch 2-4 times faster without any other config changes. -- Sami Siren

[ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Sami Siren
information on Apache Nutch, visit the project home page: http://lucene.apache.org/nutch -- Sami Siren (on behalf of the Apache Nutch community)

Re: Fwd: fetch but not index

2009-03-11 Thread Sami Siren
. That is why it does not end up in the index. -- Sami Siren

Re: Running multiple processes on a single machine

2009-03-11 Thread Sami Siren
linkdb). -- Sami Siren

Re: Working with Solr. Doubts

2009-03-10 Thread Sami Siren
snippets. -- Sami Siren

Re: Exception when crawling

2009-03-04 Thread Sami Siren
dealmaker wrote: I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM). What's wrong? There was a change in hadoop that caused this problem to appear. It has now been fixed on build #743 -- Sami Siren

Re: How do you setup your svn for your nutch code?

2009-03-02 Thread Sami Siren
just a FYI, there is also (unofficial) git repos for many apache projects - including nutch here: http://jukka.zitting.name/git/ -- Sami Siren Dingding Ye wrote: similar. 1. git-svn clone nutch-trunk Then create a git project which is my working project. After that, clone the nutch-git

Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren
Hi, and thanks for being persistent. Can you specify what is the version of nutch that you are running, is it a nightly build (if yes, which one?) or did you check out the svn trunk? And just to be sure: you are running with default configuration? -- Sami Siren ahammad wrote: I checked

Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren
I can see this error also. not sure yet what's going wrong... -- Sami Siren Justin Yao wrote: log4j configure: log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout log4j.logger.org.apache.nutch=TRACE log4j.logger.org.apache.hadoop=TRACE Output: 2009-03-02 17:53:21,987 DEBUG

Re: Problem with crawling using the latest 1.0 trunk

2009-03-02 Thread Sami Siren
Sami Siren wrote: I can see this error also. not sure yet what's going wrong... it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideas what changed in hadoop that might have caused this? -- Sami Siren -- Sami Siren Justin Yao wrote: log4j configure

Re: log org.apache.solr.common.SolrException: Bad Request when indexing feeds with solrindexer.

2009-02-23 Thread Sami Siren
) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.jav a:217) at Hi, I would check the Solr log to see why it is failing, probably Nutch is providing content to a field not present in sol schema. -- Sami Siren

Re: Nutch 1.0 - Setting up and running Nutch for crawling and Solr for indexing and querying.

2009-02-22 Thread Sami Siren
in testing the current nightly builds and providing documentation patches or wiki updates is appreciated. -- Sami Siren nightly build? thanks On Fri, Feb 20, 2009 at 6:31 PM, Kham Vo k...@mac.com wrote: Hello Nutch 1.0 designers, I successfully installed and set up Nutch 1.0 (build # 722

Re: HTTP Status 500 - No Context configured to process this request

2009-02-22 Thread Sami Siren
(searcher.dir) - execute (from command line) bin/nutch org.apache.nutch.searcher.NutchBean query -- Sami Siren Thanks Sam Hi, I just dropped Nutch web app into tomcat version 6.0.18 and it worked fine, perhaps you should upgrade your Tomcat? -- Sami Siren samuel.gre...@mesaaz.gov wrote

Re: Feed indexing with solrindex not working.

2009-02-22 Thread Sami Siren
(?) There is an open issue for this https://issues.apache.org/jira/browse/NUTCH-699. Please contribute your findings there. -- Sami Siren

Re: HTTP Status 500 - No Context configured to process this request

2009-02-20 Thread Sami Siren
Hi, I just dropped Nutch web app into tomcat version 6.0.18 and it worked fine, perhaps you should upgrade your Tomcat? -- Sami Siren samuel.gre...@mesaaz.gov wrote: Hi, I am following the tutorial here: http://nutch.sourceforge.net/docs/en/tutorial.html Crawling works fine, as does

Re: Distributed Search Server fails with Trunk

2009-02-19 Thread Sami Siren
for version = 1.0, priority = blocker. thanks. -- Sami Siren

Re: nutch restart after recrawl

2009-02-19 Thread Sami Siren
as indexing back end, the integration is in nightly version of nutch. I am not sure if the procedure is documented anywhere. -- Sami Siren All scripts suggest restarting nutch but this leads that searching is unavailable for a few minutes. May I call an API or something?

Re: Fetcher2 doesn't print status information on console

2009-02-19 Thread Sami Siren
but not for Fetrcher2. If you add such line for Fetcher2 it should start outputting logging to stdout. -- Sami Siren Thanks in advance. Kind regards, Martina

Re: Fetcher2 crashes with current trunk

2009-02-19 Thread Sami Siren
Dog(acan Güney wrote: I think I have found the bug here, but I am in a hurry now, I will create a JIRA issue and post (what is hopefully) the fix later today. Great! thanks. -- Sami Siren On Tue, Feb 17, 2009 at 21:39, Dog(acan Güney doga...@gmail.com wrote: 2009/2/17 Sami Siren ssi

Re: Restarting Nutch

2009-02-18 Thread Sami Siren
). If your setup is similar and you ensure that the filesystem can survive single node failures your data should be safe. -- Sami Siren

Re: How many kb is a page's index?

2009-02-18 Thread Sami Siren
that people daily used on windiws? which can maximize performance? Well it can be anything, the important thing is to set up a small system with similar hardware and see how it performs. That way you can get quite accurate estimates on larger scale systems running on similar hardware. -- Sami Siren

Re: Fetcher2 crashes with current trunk

2009-02-17 Thread Sami Siren
Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is reproducible. -- Sami Siren Dog(acan Güney wrote: Thanks for detailed analysis. I will take a look and get back to you. On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote: Hi, sorry

Re: Trying to understand how webapp works

2009-02-17 Thread Sami Siren
the directory did you? It might be working because the webapp still has references to all files it needs. Restart tomcat and it should work no more. -- Sami Siren

Re: indexing after fetching

2009-02-17 Thread Sami Siren
, updatedb, generate... -- Sami Siren

Re: nutch jdk?

2009-02-09 Thread Sami Siren
Dennis Kubes wrote: jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use tomcat but should run on any jsp/servlet container, websphere included. I think you need 1.6 now (for trunk) since we use Hadoop 0.19. -- Sami Siren

Re: nutch jdk?

2009-02-09 Thread Sami Siren
buddha1021 wrote: Sami Siren-2 wrote: Dennis Kubes wrote: jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use tomcat but should run on any jsp/servlet container, websphere included. I think you need 1.6 now (for trunk) since we use Hadoop 0.19. -- Sami

Re: how to create a new ngp file for Telugu in nutch

2008-08-21 Thread Sami Siren
is to enable language-identifier plugin and execute class through the plugin command: bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.NGramProfile -create te sample_te.txt utf-8 -- Sami Siren

directions for web ui? [was Re: web2 plugins compilation error]

2008-08-21 Thread Sami Siren
directions (where did all that time go?). I think that we need a simple to maintain ui that is easy to customize (both of the current ui fail to satisfy those requirements IMO). What kind of thought do others have? -- Sami Siren michos101 wrote: Hi, i am trying to enable the web2 plugins but i

Re: Next Generation Nutch

2008-04-12 Thread Sami Siren
and tutorials (maybe even a book :)). So up to this point I have created MapReduce jobs that use spring for dependency injection and it is simple and works well. The above is the direction I would like to head down but I would also like to see what everyone else is thinking. Dennis -- Sami

Re: Nutch training at ApacheCon EU 2008

2008-03-25 Thread Sami Siren
of interesting lucene/solr/hadoop related stuff there to attend to. -- Sami Siren

Re: can't find hadoop classes necessary to use Nutch API

2007-11-29 Thread Sami Siren
. Is there any way to use Nutch without them? Thank you for answers to any or all of these questions. The hadoop jar (hadoop-version-core.jar) should be available under lib/. Nutch cannot be compiled/run without it. -- Sami Siren

Re: java.lang.NoClassDefFoundError Nutch 0.9

2007-11-08 Thread Sami Siren
karthik085 wrote: Hi, I got nutch from svn tags - release0.9 - but can't get rid of this problem. I did ant compile ant jar ant war All of them build successfully with different versions of ant - 1.6.5 and 1.7.0 do ant job -- Sami Siren

Re: PDF problems, inc. documents returned with XLS extension

2007-10-22 Thread Sami Siren
. -- Sami Siren

Re: Indexer does not update the Lucene TITLE field

2007-10-19 Thread Sami Siren
Sergio Morales wrote: Hi Sami, Thanks for the info. Is there any other way to share this? create a jira issue and attach to it? -- Sami Siren

Re: Indexer does not update the Lucene TITLE field

2007-10-19 Thread Sami Siren
html document to your webserver/filesytem. There was not any html document attached. This is because mailing list software removes them. -- Sami Siren

Re: Problems running multiple nutch nodes

2007-10-04 Thread Sami Siren
vm processes with hadoop conf like: property namemapred.child.java.opts/name value-Xmx1000m/value /property -- Sami Siren

Re: IOException using feed plugin - NUTCH-444

2007-07-03 Thread Sami Siren
showed did not have it registered) -- Sami Siren java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required! at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren
hand I think that things are already too complicated for novice users/imo) 2) Make it work in distributed setups (i.e. with more than 1 index server) . Sami Siren also makes a note of this, but I don't believe that a simple hash-the-url approach is appropriate for nutch. It would be nice

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren
like a thing that can manage large online indexes perhaps it would serve most goodness if it was not tied to nutch. -- Sami Siren

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Sami Siren
simplicity in mind, other motivation was doing it without touching Nutch source code. -- Sami Siren

Re: Enabling Spell-Check plugin in contrib

2007-06-15 Thread Sami Siren
org.apache.nutch.webapp.common does not exist Could you help me to know where is a problem? it seems you can just ignore step #5, because they get compiled in #7 -- Sami Siren

Re: Enabling Spell-Check plugin in contrib

2007-06-13 Thread Sami Siren
. -- Sami Siren

Re: Regex-urlfilter

2007-05-16 Thread Sami Siren
or configure crawl to use regex-urlfilter.xml via crawl-tool.xml. -- Sami Siren

Re: fetch single host

2007-05-11 Thread Sami Siren
would require source code changes) -- Sami Siren

Re: urlfilter-suffix bug ?

2007-05-06 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Emmanuel JOKE wrote: ... those files. I tried to look at the code and I think the plugin doesn't manage correctly the dynamic URL with ? and parameters after the extension of the file. Yes your observation is correct, the filter compares only

Re: nutch freezing issue

2007-05-05 Thread Sami Siren
Siddharth Jonathan wrote: Hi, After a couple of days of being up, my nutch app begins to freeze/hang and basically indexing and searching can no longer happen. During this time (couple of days) is it just sitting idle or serving requests? -- Sami Siren

Re: urlfilter-suffix bug ?

2007-05-05 Thread Sami Siren
the functionality so it meets your requirement. -- Sami Siren

Re: Nutch and running crawls within a container.

2007-04-30 Thread Sami Siren
several crawlers running concurrently. We You should perhaps use and call the classes directly and take control of managing the Configuration object, this way PermGen size is not wasted by loading same classes over and over again. -- Sami Siren

Re: Can anybody tell me how the Nutch-0.9 is different than nutch-0.8.1

2007-04-20 Thread Sami Siren
://issues.apache.org/jira/secure/BrowseProject.jspa?id=10680subset=3 where most of the changes are listed. -- Sami Siren

Re: Classpath and plugins question

2007-04-19 Thread Sami Siren
project called Apache Tika [1] which has a goal of putting together generally usable parsing/extracting framework. It hasn't yet got out of the ground so there is a good chance to get your voice heard. [1] http://incubator.apache.org/tika/ -- Sami Siren

Re: How to recude the tmp disk space usage during linkdb process?

2007-04-11 Thread Sami Siren
to cut down your temp size requirements (after compression, I think it's possible to compress the temp data?) is to do your work in smaller slices. -- Sami Siren - Original Message From: qi wu [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 11, 2007 10:41:35

Re: Fetcher2 too many spinWaiting, How to tune?

2007-04-02 Thread Sami Siren
|fetcher2 can fetch them fast unless you make it non polite. -- Sami Siren

Re: Crawling + Indexing staging vs. production and URL conflict

2007-04-01 Thread Sami Siren
Tomi N/A wrote: 2007/3/31, Sami Siren [EMAIL PROTECTED]: You could also let your reverse proxy do the rewriting using something like http://apache.webthing.com/mod_proxy_html/. I have been using something like that for rewriting massive amount of html in realtime for AA purposes to hammer

Re: Crawling + Indexing staging vs. production and URL conflict

2007-03-31 Thread Sami Siren
of html in realtime for AA purposes to hammer web applications to different url space. -- Sami Siren

Re: Merging WebDBs

2007-03-23 Thread Sami Siren
/Projects/DummyNutch/Nutch/linkdb/parse_data in local is invalid. thanks in advance for help LinkDb treats the parameter invertlinks as the path to linkdb (the 1st parameter), remove it and the command should succeed. -- Sami Siren

Re: Nutch and GET

2007-03-23 Thread Sami Siren
PROTECTED] -- Sami Siren

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

2007-02-02 Thread Sami Siren
Nicolás Lichtmaier wrote: I've backported revision 450799 to the 0.8.x branch for supporting -noAdditions. Perhaps you could consider committing it there... (I haven't tested it yet whough). Can you please create a JIRA issue for this and attach the patch there. -- Sami Siren

Re: Indexing only some filetypes with Nutch

2007-01-24 Thread Sami Siren
to find the images. You would also need to change indexer to index just the content you are interested in (images) and skip the rest. -- Sami Siren

Re: Compiling PruneIndexTool trouble

2007-01-22 Thread Sami Siren
PruneIndexTool). $ ant -- Sami Siren

Re: How to stop a slow fetch?

2007-01-18 Thread Sami Siren
time up until that point. There's some more about that issue and how it affected to a random segment here: http://blog.foofactory.fi/2007/01/sorted-out.html -- Sami Siren

Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
if it is active. http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html -- Sami Siren

Re: List owner?

2007-01-07 Thread Sami Siren
Owner can be reached at [EMAIL PROTECTED] What kind of error are you experiencing (if any)? -- Sami Siren James Phillips wrote: Can somebody tell me how to contact the owner of this list? I have tried on COUNTLESS occasions to remove myself using [EMAIL PROTECTED] but still keep

Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
if that suits your use case. -- Sami Siren - Original Message - From: Sami Siren [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Sunday, January 07, 2007 5:47 PM Subject: Re: Nutch .81: the process to add a new analyzer ? Chee Wu wrote: Hi, I am trying to add a new

Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread Sami Siren
right results are not that good. identification method that would be helpful. Otherwise, I'd be happy to contribute my pseudo-NB hack and maybe even implement the correct version. Go ahead and attach it to JIRA. I am sure there's plenty of people interested in such thing. -- Sami Siren

Re: How best to add sponsored link support..??

2006-12-19 Thread Sami Siren
Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http

Re: subcollections

2006-12-16 Thread Sami Siren
anything wrong. If you did exactly those steps then what happens is that the subcollections.xml is read from inside the .job file. You need to rebuild the .job to put new file inside of it. simply do ant and rerun indexing and it should work as expected. -- Sami Siren

Re: error with trunk: linkdb copied to wrong dir

2006-12-14 Thread Sami Siren
-- Sami Siren

Re: subcollections

2006-12-14 Thread Sami Siren
(ie. add a site to a newly created subcollection) I don't want to recrawl it again. I hope it can be done by simply using the existent/crawled data. no need to recrawl, unfortunately you still need to reindex. -- Sami Siren

Re: Fetcher hung on final hurdle - continue?

2006-12-08 Thread Sami Siren
. -- Sami Siren

Re: indexing from local file system -- indexing from HDFS

2006-11-22 Thread Sami Siren
contents of hdfs also. One could also write a protocol-hdfs plugin to do the job. -- Sami Siren [1]http://issues.apache.org/jira/browse/HADOOP-4

Re: Fetch fails

2006-11-22 Thread Sami Siren
- HttpBase.getProtocolOutput(194) | Skipping: http://www.lequipe.fr/ exceeds fetcher.max.crawl.delay, max=30, Crawl-Delay=120 and i can't find this property in nutch-site.xml You need to add it there. property namefetcher.max.crawl.delay/name value your value here /value /property -- Sami Siren

Re: Nutch sessions cookies on https protocol

2006-11-22 Thread Sami Siren
Gavino Marras wrote: Nutch does work with sessions and cookies on https protocol ? No, Nutch does not support cookies nor sessions. -- Sami Siren

Re: Nutch sessions cookies on https protocol

2006-11-22 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Gavino Marras wrote: Nutch does work with sessions and cookies on https protocol ? No, Nutch does not support cookies nor sessions. This is not strictly speaking true ... if you use protocol-httpclient then https, cookies and sessions

Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren
://issues.apache.org/jira/browse/NUTCH-395 [2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html -- Sami Siren

Re: Strategic Direction of Nutch

2006-11-13 Thread Sami Siren
are even worse. my numbers are with local job runner. I can't imagine how much it took to crawl let say 10mio pages. I'll let you know when mine is finished, just started 3rd segment of size 1 million to test the trunk version (running with local job runner) -- Sami Siren [1]http

Re: Nutch as static exporter?

2006-10-31 Thread Sami Siren
what as static exporter means? -- Sami Siren

Re: large number of urls from Generator are not fetched?

2006-10-31 Thread Sami Siren
Are you saying that generator generates 200k urls but fetcher fetches around 100k or are you saying that you generate (-topN 20) 200k urls and fetcher fetches only around 100k. If latter and you are running with LocalJobRunner you need to generate with -numFetchers 1. -- Sami Siren

Re: Speeding things up!

2006-10-29 Thread Sami Siren
forgot one important one: set generate.max.per.host to something reasonable so you won't end up fetching urls from only low number of hosts which by default is very slow. -- Sami Siren Sami Siren wrote: Some simple rules for generally speeding things up 1. Crawl only the content you

Re: Nutch slow how to speed up?

2006-10-24 Thread Sami Siren
You are using DistributedSearch? and local filesystem to store index and related data? -- Sami Siren Håvard W. Kongsgård wrote: I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s. My config

Re: Nutch slow how to speed up?

2006-10-24 Thread Sami Siren
from this proposal: http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL PROTECTED] -- Sami Siren Håvard W. Kongsgård wrote: DistributedSearch 2x datanodes, 2x Task Trackers Sami Siren wrote: You are using DistributedSearch? and local filesystem to store index

Re: Modifying Nutch core

2006-10-24 Thread Sami Siren
(to compile and to create nutch-x.x.x.job) then: bin/nutch ... -- Sami Siren

Re: Indexing the file system / best approach

2006-10-18 Thread Sami Siren
:/// and to generate a file list to be crawled. This file list is fairly big ~200,000 entries, and with the current 0.8.1 release of nutch the fetcher just freezes right at the end of a crawl. What exactly happens when your fetcher freezes? 200 000 entries is not a big list to be fetched. -- Sami Siren

Re: Lucene query support in Nutch

2006-10-07 Thread Sami Siren
application. I agree also. Different query parsers could perhaps be made pluggable or at least configurable. The current(-alike) implementation could be the default one offered and by configuration one could switch it to intranet mode. Contributions anyone? -- Sami Siren

Re: stop an index server

2006-09-29 Thread Sami Siren
' -shutdown 127.0.0.1 -- Sami Siren Alvaro Cabrerizo wrote: 2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]: Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance

Re: Problem Searching

2006-09-29 Thread Sami Siren
what you need to do is modify the Query. -- Sami Siren Thanks,

Re: stop an index server

2006-09-27 Thread Sami Siren
Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance. It does not support such a feature. Can you describe a little bit more what are you trying to accomplish something similar to tomcats SHUTDOWN? -- Sami

[ANNOUNCE] Nutch 0.8.1 available

2006-09-26 Thread Sami Siren
branch and fixes many serious bugs discovered in previous release. For a list of changes see http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt A big thanks to everybody who participated and made this release possible. -- Sami Siren

Re: Cannot generate all injected URLS

2006-09-22 Thread Sami Siren
defaults to 2) -- Sami Siren Frank Kempf wrote: Hello, got stuck with generating. Injecting 3200 Urls into the database and generating afterwards leads always to the same result of having 1632 Urls in crawl_generate. (I checked the db and it actually has 3200 entries). No matter if I try -topN

Re: Is that true?

2006-09-18 Thread Sami Siren
Your observations are correct, 0.8 has some serious problems and we'll be putting 0.8.1 out pretty soon to fix also the performance problem you describe. -- Sami Siren 2006/9/18, carmmello [EMAIL PROTECTED]: I have been trying Nutch, since its version 0.3, sometimes with some problems. Now I

Re: log records

2006-09-01 Thread sami siren
Is your environment windows or linux? You are saying that most are not logged - can you please give an example what is logged (and where) and also what is not. Logging in general can be configured by editing conf/log4j.properties -- Sami Siren 2006/9/1, AJ Chen [EMAIL PROTECTED]: When

Re: Is there a way to get Nutch to parse/index by file access directly (not over HTTP)?

2006-08-28 Thread sami siren
could also be succesfully used for efficient crawling of smb, ftp and webdaw resources, -- Sami Siren 2006/8/27, Sandy Polanski [EMAIL PROTECTED]: This maybe more of a straight Lucene task, but I thought I'd ask anyway. Rather than using Nutch as a crawler, I'd rather just send the Nutch parser

Re: Nutch doesn't dive deeper

2006-08-27 Thread sami siren
text/vnd.wap.wml text/xml text/x-setext I would guess that handling of text/xhtml+xml mimetpe should be done with html parser anyway. -- Sami Siren 2006/8/25, Michael Wechner [EMAIL PROTECTED]: I think the problem is as follows with XHTML files: 2006-08-25 16:06:11,925 WARN

Re: Making crawler stop after all pages are found.

2006-08-27 Thread Sami Siren
The job should terminate in it's own, but not as soon as all pages are found - only after -depth iterations. Are you saying It won't honor the -depth parameter? -- Sami Siren Sandy Polanski wrote: Sami, in 0.7.2 my intranet crawling job did terminate on its own. The issue that I described

Re: Nutch doesn't dive deeper

2006-08-27 Thread sami siren
of mime types Nutch really can handle. Then again those two text type of documents you picked up are quite rare and not mainstream and probably enabling/disabling them doesn't really make any difference in search results. -- Sami Siren

Re: Making crawler stop after all pages are found.

2006-08-26 Thread sami siren
There's no such feature present in Nutch currently. Feel free to open issue (of type new feature) in Nutch Jira and provide a patch or wait until someone else gets to it. -- Sami Siren 2006/8/27, Sandy Polanski [EMAIL PROTECTED]: On my intranet, I have 8100 documents. The nutch crawler

  1   2   >