Re: nutch-2.x with hbase filter option

2014-03-26 Thread alxsss
Thanks Otis. This is what I was looking for. After applying this patch to the current trunk with some modifications I have gora-core-0.4-SNAPSHOT.jar gora-hbase-0.4-SNAPSHOT.jar With hbase-0.94.17.jar the inject command gives Exception in thread "main" java.lang.NoSuchMethodError: org.apache.ha

Re: nutch-2.x with hbase filter option

2014-03-27 Thread alxsss
Hi Alparslan, I downloaded GORA_94 branch and with libs from it a get 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: test_urls Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/gora/persistency/StateManager at java.lang.Class.getDeclaredCons

Re: nutch-2.x with hbase filter option

2014-03-31 Thread alxsss
Subject: Re: nutch-2.x with hbase filter option Hi alxsss, On Sat, Mar 29, 2014 at 10:15 PM, wrote: > > I downloaded GORA_94 branch and with libs from it a get >> >> 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: >> test_urls >

Re: nutch-2.x with hbase filter option

2014-04-09 Thread alxsss
Hi, I was able to fix these errors making some changes to code and using avro-1.7. Now when I run updatedb command it gives 2014-04-09 14:29:36,460 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchMethodError: org.apache.nutch.storage.WebPage.getFetchInterval()I

Re: nutch-2.x with hbase filter option

2014-04-10 Thread alxsss
class? And do you run in local or distributed mode? Thanks, Alparslan On 10-04-2014 01:07, alxsss wrote: > Hi, > > I was able to fix these errors making some changes to code and using > avro-1.7. > Now when I run updatedb command it gives > > 2014-04-09 14:29:36,460 FATAL or

Re: nutch-2.x with hbase filter option

2014-04-11 Thread alxsss
Actually, patch provided in the original issue https://issues.apache.org/jira/browse/NUTCH-1714 is not completely applicable to the current trunk, because nutch code has been changed since. I modified original patch in order to apply to the current trunk. I can combine modifed patch with the cha

Re: crawl every 24 hours

2014-05-21 Thread alxsss
Hi, Another way of doing this is to increase db.fetch.interval.default to x years and inject each time the original seed. In this way you will fetch only new pages during x year, since injected urls fetch time is set to current time (I believe, you can double check it first) and the other fet

anchor text in content field

2014-06-10 Thread alxsss
Hello, Is there a way to configure nutch not to put anchors in content field? Thanks. Alex.

updatedb deletes all metadata except _csh_

2014-06-16 Thread alxsss
Hello, I am using nutch-2.x with GORA_97. I noticed that the second updatedb deletes all metadata except _csh_ for pages from the first fetch. Step to reproduce are the following 1. inject 2.generate batchId 1 3. fetch batchId 1 that adds some metadata to mtdt field 4 updatedb batchId 1 5.gene

Re: updatedb deletes all metadata except _csh_

2014-06-16 Thread alxsss
Further investigation shows that DbUpdateReducer calls inlinkedScoreData.clear(); and it calls this function public void readFields(DataInput in) throws IOException { System.out.println("readFields in score datum is called"); score = in.readFloat(); url = Text.readString(in);

Re: updatedb deletes all metadata except _csh_

2014-06-18 Thread alxsss
Hello, I have gora_94 with hbase-0.94.17 and avro-1.7.6. I have investigated further and it turned out that the culprit is not inlinkedScoreData.clear() and found another issue in addition to the deletion of custom metadata. For the simplicity let's consider only one seed url, let say m

Re: anchor text in content field

2014-06-18 Thread alxsss
Hi, I went ahead and modified DOMBuilder. The use case is that some silly newspapers put links to all today's articles at the end of each article. Let say today there were 3 articles. Two of them is about Obama and one is about J.Lopez. At the end of the article about J.Lopez there are two

Re: updatedb deletes all metadata except _csh_

2014-06-23 Thread alxsss
Hi, So far, this looks like a bug in updatedb when filtering with batchId. I could only found one solution, to check if new pages are in the datastore and if they are skip them. Otherwise updatedb with option -all will also work. Thanks. Alex. -- View this message in context: http://lucene.

Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread alxsss
Hi, I already came up with similar changes to the code as in this patch. Only suggestion to this patch's code is that to move checking if url exists in the datastore under if (!additionsAllowed) { return; } and close datastore. Thanks. Alex. -Original Message- From

Re: How to reduce the unfetched urls?

2014-08-08 Thread alxsss
What is status of one of the unfetched urls in the db? -Original Message- From: adu To: user Sent: Thu, Aug 7, 2014 8:04 pm Subject: How to reduce the unfetched urls? Hi all, I use 1 urls as the seeds , and crawl with depth 1. The result I got is only 2000 urls are fetch

parsing mime-type text/html with parse-tika

2015-03-31 Thread alxsss
Hello, I try to use nutch-2.x trunk to parse text/html types with tika. I get error "parser for text/html not found". I see that parse-tika code was changed. These lines // get the right parser using the mime type as a clue

Re: How to investigate recrawl issue

2015-04-29 Thread alxsss
There must be some config variable that allows to set timeModified to current date when injected. You need to inject home page url on each run. hth Alex. -Original Message- From: Matteo Diarena To: user Sent: Wed, Apr 29, 2015 1:46 pm Subject: How to investigate recrawl issue Dea

using less resources

2012-05-22 Thread alxsss
Hello, As far as I understood nutch recrawls urls when their fetch time has past current time regardless if those urls were modified or not. Is there any initiative on restricting recrawls to only those urls that have modified time(MT) greater than the old MT? In detail: if nutch have crawled a

nutch-2.0 updatedb and parse commands

2012-06-18 Thread alxsss
Hello, It seems to me that all options to updatedb command that nutch 1.4 has, have been removed in nutch-2.0. I would like to know if this was done purposefully or they will be added later? Also, how can I create multiple doc using parse command? It seem there is no sufficient arguments to par

Re: nutch-2.0 updatedb and parse commands

2012-06-19 Thread alxsss
Hi Lewis, In 1.X version there are -noAdditions options to updatedb and -adddays option to generate commands. How something similar to them can be done in 2.X version? Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated "Modify code so that parser can generate multiple documents whi

Re: using less resources

2012-06-20 Thread alxsss
I was thinking of using last modified header, but it may be absent. In that case we could use signature of urls in the indexing time. I took a look to to code, it seems it is implemented but not working. I tested nutch-1.4 with a single url, solrindexer always sends the same number of documents to

parse and solrindex in nutch-2.0

2012-06-25 Thread alxsss
Hello, I have tested nutch-2.0 with hbase and mysql trying to index only one url with depth 1. I tried to fetch an html tag value and parse it to metadata column in webpage object by adding parse-tag plugin. I saw there is no metadata member variable in Parse class, so I used putToMetadata fu

Re: parse and solrindex in nutch-2.0

2012-07-02 Thread alxsss
Hi, Thank you for clarifications. Regarding the metadata, what would be a proper way of parsing end indexing multivalued tags in nutch-2.0 then? Thanks. Alex. -Original Message- From: Ferdy Galema To: user Sent: Wed, Jun 27, 2012 1:20 am Subject: Re: parse and solrindex in nutch-2.

Re: parse and solrindex in nutch-2.0

2012-07-03 Thread alxsss
Hi, I was planning to parse img tags from a url content and put it in metadata filed of Webpage storage class in nutch2.0 to retrieve them later in the indexing step. However, since there is no metadata data type variable in Parse class (compare with outlinks) this can not be done in nutch 2.0

updatedb in nutch-2.0 with mysql

2012-07-24 Thread alxsss
Hello, I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb command does not do anything. It does not add outlinks to the table as new urls and I do not see any error messages in hadoop.log Here is the log entries without plugin load info INFO crawl.DbUpdaterJob -

Re: updatedb in nutch-2.0 with mysql

2012-07-25 Thread alxsss
Not sure if I understood correctly. I did Counters c currentJob.getCounters(); System.out.println(c.toString()); With Mysql DbUpdaterJob: starting Counters: 20 DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=878298 FILE_BYTES_WR

Re: updatedb in nutch-2.0 with mysql

2012-07-26 Thread alxsss
I queried webpage table and there are a few links in outlinks column. As I noted in the original letter updatedb works with Hbase. This is the counters output in the case of Hbase. bin/nutch updatedb DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters F

Re: updatedb in nutch-2.0 with mysql

2012-07-27 Thread alxsss
I tried your suggestion with sql server and everything works fine. The issue that I had was with mysql though. mysql Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1 After I have restarted mysql server and added to gora.properties mysql root user, updatdb adds outlinks as new url

Re: Nutch 2.0 & Solr 4.0 Alpha

2012-07-29 Thread alxsss
Which storage do you use? Try solrindex with option -reindex. -Original Message- From: X3C TECH To: user Sent: Sun, Jul 29, 2012 10:58 am Subject: Re: Nutch 2.0 & Solr 4.0 Alpha Forgot to do Specs VMWare Machine with CentOS 6.3 On Sun, Jul 29, 2012 at 1:53 PM, X3C TECH wrote: > H

Re: Why won't my crawl ignore these urls?

2012-07-30 Thread alxsss
Why do not you test your regex, to see if it really takes the urls you want to eliminate. It seems to me that your regex does not eliminate the type of urls you specified. Alex. -Original Message- From: Ian Piper To: user Sent: Mon, Jul 30, 2012 1:52 pm Subject: Re: Why won't my cra

Re: Different batch id

2012-07-31 Thread alxsss
Hi, Most likely you run generate command a few times and did not run updatedb. So, each generate command assigned different batchId s to its own set of urls. Alex. -Original Message- From: Bai Shen To: user Sent: Tue, Jul 31, 2012 10:26 am Subject: Re: Different batch id Is there

updatedb fails to put UPDATEDB_MARK in nutch-2.0

2012-07-31 Thread alxsss
Hello, I noticed that updatedb command must remove gen, parse and fetch marks and put UPDATEDB_MARK mark. according to the code Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } in DbUpdateReducer.java However,

Re: Nutch 2 solrindex

2012-08-01 Thread alxsss
This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark. I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment. Alex. -Original Message- From: Bai Shen To: user Sent: W

Re: Nutch 2 solrindex

2012-08-02 Thread alxsss
The current code putting updb_mrk in dbUpdateReducer is as follows Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } the mark is always null, independent if there is PARSE_MARK or not. This function calls public Utf8

Re: Different batch id

2012-08-02 Thread alxsss
Hi, I have found out that, what happens after bin/nutch generate -topN 1000 is that only 1000 of the urls have been marked by gnmrk Then bin/nutch fetch -all skips all urls that do not have gnmrk according to the code Utf8 mark = Mark.GENERATE_MARK.checkMark(page); if (!NutchJob.shouldProc

Re: Nutch 2 encoding

2012-08-09 Thread alxsss
Hi, I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem? Alex. -Original Message- From: Ake Tangkananond To: user Sent: Thu, Aug 9, 2012 11:12 am Subject: Re: Nutch 2 encoding Hi, I'm debugging. I inserted a code to print out the encoding her

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss
Hello, I am getting the same error and here is the log 2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputSt

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss
I was able to do jstack just before the program exited. The output is attached. -Original Message- From: alxsss To: user Sent: Sat, Aug 11, 2012 2:17 pm Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hello, I am getting the same error and here is

updatedb error in nutch-2.0

2012-08-12 Thread alxsss
Hello, I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) at o

Re: updatedb error in nutch-2.0

2012-08-13 Thread alxsss
I found out that the key sent to unreverseUrl in DbUpdateMapper.map was ":index.php/http" This happened in the depth 3 and I checked seed file there was no line in the form of http:/index.php Thanks. Alex. -Original Message- From: Ferdy Galema To: user Sent: Mon, Aug 13, 2012 1:5

Re: nutch 2.0 with hbase 0.94.0

2012-08-13 Thread alxsss
did you delete the old hbase jar from the lib dir? Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Mon, Aug 13, 2012 10:16 am Subject: Re: nutch 2.0 with hbase 0.94.0 Nutch contains no knowledge of which specific version of a backend you are using. This is however

Re: nutch 2.0 with hbase 0.94.0

2012-08-13 Thread alxsss
I tried to upgrade to hbase-0.94.0 from hbase-0.92.1 , Started hbase-0.94.0 and forget to replace hbase-0.92.1.jar with the new one:). With this config inject worked fine. But when I replaced old jar( hbase-0.92.1.jar) with new one. hbase-0.94..0.jar I get the same error as you. Hope this wil

updatedb goes over all urls in nutch-2.0

2012-08-17 Thread alxsss
Hi, I noticed that updatedb command goes over all urls, even if they have been updated in the previous generate, fetch updatedb stages. As a result updatedb takes long time depending on the number of rows in the datastore. I thought maybe this is redundant and it must be restricted to not update

fetcher fails on connection error in nutch-2.0 with hbase

2012-08-19 Thread alxsss
After fetching for about 18 hours fetcher throws this error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketI

speed of fetcher in nutch-2.0

2012-08-23 Thread alxsss
Hello, I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3 fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K urls per hour. Any ideas what could cause this decrease in speed. I use local mode with 10 threads. Thanks. Alex.

Re: recrawl a URL?

2012-08-24 Thread alxsss
This will work only for urls that has If-Modified-Since headers. But most urls does not have this header. Thanks. Alex. -Original Message- From: Max Dzyuba To: Markus Jelsma ; user Sent: Fri, Aug 24, 2012 9:02 am Subject: RE: recrawl a URL? Thanks again! I'll have to test it

nutch-2.0 --Attempting to finish item from unknown queue

2012-08-26 Thread alxsss
Hello, I use nutch-2.0 with hbase-0.92.1. in local mode.? After fetching for about 20 hours, ?I see error WARN? fetcher.FetcherJob - Attempting to finish item from unknown queue: FetchItem . . . folowed by java.net.ConnectException: Connection refused ??? at sun.nio.ch.Socke

Re: Nutch 2 solrindex fails with no error

2012-09-17 Thread alxsss
You can use -reindex option, since updt markers are not set properly in 2.0 release. -Original Message- From: Bai Shen To: user Sent: Mon, Sep 17, 2012 10:16 am Subject: Re: Nutch 2 solrindex fails with no error The problem appears to be that Nutch is not sending anything to s

updatedb in nutch-2.0 increases fetch time of all pages

2012-09-17 Thread alxsss
Hello, updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not. For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page A

Re: Building Nutch 2.0

2012-10-01 Thread alxsss
It seems to me that if you run nutch in deploy mode and make changes to config files then you need to rebuild .job file again unless you specify config_dir option in hadoop command. Alex. -Original Message- From: Christopher Gross To: user Sent: Mon, Oct 1, 2012 1:22 pm Subject: Re:

nutch-2.0 generate in deploy mode

2012-10-01 Thread alxsss
Hello, I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate command takes 87% of cpu in deploy mode versus 18% in local mode. Any ideas how to fix this issue? Thanks. Alex.

Re: Building Nutch 2.0

2012-10-02 Thread alxsss
According to code in bin/nutch if you have .job file in you NUTCH_HOME then it means that you run it in deploy mode. If there is no .job file then you run it in local mode, so you do not need to build nutch each time you change conf files. Alex. -Original Message- From: Christ

Re: Error parsing html

2012-10-02 Thread alxsss
Can you provide a few lines of log or the url that gives the exception? -Original Message- From: CarinaBambina To: user Sent: Tue, Oct 2, 2012 2:04 pm Subject: Re: Error parsing html Thanks for the reply. I'm now using Nutch 1.5.1, but nothing has changed so far. While debugging

Re: Error parsing html

2012-10-09 Thread alxsss
I checked the url you privided with parsechecker and they are parsed correctly. You can check yourself by doing bin/nutch parsechecker yoururl. In you implementation can you check if segment dir has correct permission. Alex. -Original Message- From: CarinaBambina To: user Se

nutch-2.0-fetcher fails in reduce stage

2012-10-15 Thread alxsss
Hello, I try to use nutch-2.0, hadoop-1.03, hbase-0.92.1 in pseudo distributed mode with iptables turned off. As soon as map reaches 100%, fetcher works for a few minutes and fails with the error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConne

Re: nutch-2.0-fetcher fails in reduce stage

2012-10-16 Thread alxsss
Hello, Today, I closely followed all hbase and hadoop logs. As soon as map reached 100% reduce was 33%. Then when reduce reached 66% I saw in hadoop's datanode log the following error 2012-10-16 22:44:54,634 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread alxsss
Hello, I think the problem is with the storage not nutch itself. Looks like generate cannot read status or fetch time (or gets null values) from mysql. I had a bunch of issues with mysql storage and switched to hbase at the end. Alex. -Original Message- From: Sebastian Nagel

Re: Same pages crawled more than once and slow crawling

2012-10-19 Thread alxsss
Hello, I meant that it could be a gora-mysql problem. In order to test it, you can run nutch in local mode with Generator Debug enabled. Put this log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout in your conf/log4j.properties and run the crawl cycle with updatedb. if gora-mysql

Re: Image search engine based on nutch/solr

2012-10-21 Thread alxsss
Hello, I have also written this kind of plugin. But instead of putting thumbnail files in solr index they are put in a folder. Only, filenames are kept in the solr index. I wondered what is the advantage of putting thumbnail files in the solr index? Thanks in advance. Alex. -Origin

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

2012-10-31 Thread alxsss
Hi, If you change this line log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout in runtime/local/conf/log4j.properties to log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout you must see more info about the parse process in logs. Alex. -Original Message- F

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

2012-11-01 Thread alxsss
< lists.digitalpeb...@gmail.com> wrote: > Hi > > Yes please do open an issue. The docs should be parsed in one go and I > suspect (yet another) issue with the SQL backend > > Thanks > > J > > On 1 November 2012 13:48, kiran chitturi > wrote: > > > Than

Re: Access crawled content or parsed data of previous crawled url

2012-11-28 Thread alxsss
It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From: Jorg

Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread alxsss
Hi, Unfortunately, my employer does not want me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez To: user Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
move or copy that jar file to local/lib and try again. hth. Alex. -Original Message- From: Arcondo To: user Sent: Fri, Jan 4, 2013 2:55 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hope that now you can see them Plugin folder

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
Which version of nutch is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch. Alex. -Original Message- From: Arcondo Dasilva To: user Sent: Fri, Jan 4, 2013 1:23 pm Subject: Re: Native Hadoop library not loaded

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-07 Thread alxsss
Hi, You can unjar the jar file, check if the class that parse complains about is inside it. You can also try to put content of jar file under local /lib. Maybe there is some read restriction. If this does not help, I can only suggest to start again with a new copy of nutch. Alex. ---

Re: Image search engine based on nutch/solr

2013-01-10 Thread alxsss
at lead us into this approach, any comments > or suggestions are welcome from Alex or anyone else. > > Greetings, > > On Oct 21, 2012, at 10:51 PM, > alxsss@ > wrote: > >> Hello, >> >> I have also written this kind of plugin. But instead of putting th

Re: nutch 2.x recrawl re-crawl

2013-01-14 Thread alxsss
I think there is no need to a new plugin or something like that. If you know list of news urls you need to inject them each cycle in order to fetch them and their new inlinks, since when you inject a url its fetch time is set to the current time. Alex. -Original Message- From:

nutch/util/NodeWalker class is not thread safe

2013-01-16 Thread alxsss
Hello, I use this class NodeWalker at src/java/org/apache/nutch/util/NodeWalker.java in one of our plugins. I noticed this comment //Currently this class is not thread safe. It is assumed that only one thread will be accessing the NodeWalker at any given time." above the class definition.

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
I see that inlinks are saved as ol in hbase. Alex. -Original Message- From: kiran chitturi To: user Sent: Wed, Jan 30, 2013 9:31 am Subject: Re: Nutch 2.0 updatedb and gora query Link to the reference ( http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-databas

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
What do you call inlinks? I call inlink for mysite.com all urls as mysite.com/myhtml1.html, mysite.com/myhtml2.html and etc. Currently they are saved as ol in hbase. from hbase shell do this get 'webpage', 'com.mysite:http/' and check what ol family looks like. I have these config db.ignore

Re: Nutch 1.6 +solr 4.1.0

2013-02-06 Thread alxsss
Hi, Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with solr-4.1.0. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 6, 2013 6:13 pm Subject: Re: Nutch 1.6 +solr 4.1.0 Hi, We are not good to go with Solr 4.1 yet. There are chan

Re: Nutch 2.1 + HBase cluster settings

2013-02-06 Thread alxsss
Hi, So, you do not run hadoop and nutch job works in distributed mode? Thanks. Alex. -Original Message- From: k4200 To: user Sent: Wed, Feb 6, 2013 7:43 pm Subject: Re: Nutch 2.1 + HBase cluster settings Hi Lewis, There seems to be a bug in HBase 0.90.4 library, which comes

Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com? Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 11:05 am Subject: Nutch identifier while indexing. Hello, I am indexing 3 sites: SiteA SiteB SiteC

nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hello, I noticed that nutch cannot retrieve title and inlinks of one of the domains in the seed list. However, if I run identical code from the server where this domain is hosted then it correctly parses it. The surprising thing is that in both cases this urls has status: 2 (status_fetched) pa

Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
The only suggestion that I know is that you can index the site param at the end of the urls as a separate field and make facet search in solr with that param values. Alex. -Original Message- From: mbehlok To: user Sent: Wed, Feb 13, 2013 12:20 pm Subject: Re: Nutch identifier

Re: nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hi, I noticed that for other urls in the seed inlinks are saved as ol. I checked the code and figured out that this is done with the part that saves anchors. So, in my case inlinks are saved as anchors in the field ol in hbase. But, for one of the ulrs, titile and inlinks are not retrieved, alt

fields in solrindex-mapping.xml

2013-02-14 Thread alxsss
Hello, I see that there are fields in addition to title, host and content ones in nutch-2.x' solr-mapping.xml. I thought tstamp may be needed for sorting documents. What about the other fields, segment, boost and digest? Can

Re: fields in solrindex-mapping.xml

2013-02-15 Thread alxsss
Hi Lewis, If I exclude one of the fileds tstamp, digest, and boost from solindex-mapping and schema.xml, solrindex gives error SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=com.yahoo:http/] unknown field 'tstamp' for each of above fields, except segment. Alex. -Origi

Re: fields in solrindex-mapping.xml

2013-02-16 Thread alxsss
-Original Message- From: Lewis John Mcgibbney To: user Sent: Fri, Feb 15, 2013 4:21 pm Subject: Re: fields in solrindex-mapping.xml Hi Alex, OK so we can certainly remove segment from 2.x solr-index-mapping.xml. It would however be nice to replace this with the appropriate bat

Re: fields in solrindex-mapping.xml

2013-02-16 Thread alxsss
Hi Lewis, Why do we need to include digest, tstamp, boost and batchid fields in solrindex? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Fri, Feb 15, 2013 4:21 pm Subject: Re: fields in solrindex-mapping.xml Hi Alex, OK so we can certainly remove

Re: fields in solrindex-mapping.xml

2013-02-16 Thread alxsss
Do you mean they help when sharding? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Sat, Feb 16, 2013 10:58 am Subject: Re: fields in solrindex-mapping.xml In short, it helps with searching when you can slice your data using these fields On Satur

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
Hi, This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters. It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records. Alex

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this https://issues.apache.org/jira/browse/GORA-119 Alex. -Original Message- From: Roland To: user S

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal netw

Re: parsechecker and redirection

2013-03-25 Thread alxsss
Hello, I would like to let you know that, currently nutch -2.x does not index redirected pages, independent of if they are parsed or not. Thanks. Alex. -Original Message- From: Sebastian Nagel To: user Sent: Mon, Mar 25, 2013 3:52 pm Subject: Re: parsechecker and redirection H

Re: error using generate in 2.x

2013-03-29 Thread alxsss
Hi, It seems that trunk has a few bugs. I found out that readdb -url urlname also gives errors. Thanks. Alex. -Original Message- From: kaveh minooie To: user Sent: Fri, Mar 29, 2013 1:53 pm Subject: Re: error using generate in 2.x Hi lewis the mapping file that I am using

Re: error using generate in 2.x

2013-03-29 Thread alxsss
Yes, with hbase. Here is the error 13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f closed 13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader: java.lang.NullPointerException at org.apache.gora.hbase.store.HBaseStore.addFields(HBaseStore.java:398) at

Re: Only recrawl the pages with http code=500

2013-04-10 Thread alxsss
Hi, == A hbase query with a filter of http status code 500 will give you the list of urls with stauts code 500. == Could you please let me know how to do this? I was trying to get an answer to this kind of selection in hbase mailing list without success. Thanks. Alex. -Original Me

Re: Only recrawl the pages with http code=500

2013-04-11 Thread alxsss
Hi, As far as I know pig script will run mapreduce jobs which will iterate over all records and if the size of the table is huge it will take a lot of time. Regarding using filters, I played around but so far was unable to use filters for markers that do not have values. For example I need to

custom solrindex in nutch-1.6

2013-04-29 Thread alxsss
Hello, I try to write custom solrindex class in nutch-1.6 to create multiple documents from one url. I have map function inside indexerMapReduce.java to send key and docs to reducer. However, reducer gets keys but docs has all fields (like id and etc) as null, although map correctly assigns th

normalize gives malformed url exception

2013-05-07 Thread alxsss
Hello, I use nutch-1.6 and the following code try{ url = new URL(base, url); imgUrl =url.toString(); // Normalize and Replace spaces with %20 url = url.replaceAll("\\s", "%20"); url = normalizers.normalize(url,URL

problem runnig custom nutch command in deploy mode

2013-05-14 Thread alxsss
Hello, I have created a custom nutch solr indexer as a jar file and put it under nutch_home/lib. It runs successfully in local mode. In deploy mode it gives the following error. The same jar file is included in job file and lib/ java.lang.RuntimeException: java.io.IOException: WritableName can'

Re: problem runnig custom nutch command in deploy mode

2013-05-15 Thread alxsss
Hi, I build these classes in a separate contrib folder. This is not a plugin which has extension point, this is let say something like solrindex files with different functionality. I run ant form contrib dir. It generates jar file. I put that jar file under lib and run ant from nutch_home. It g

Re: error crawling

2013-05-17 Thread alxsss
What if you do bin/nutch inject urls/ ? -Original Message- From: Christopher Gross To: user Sent: Fri, May 17, 2013 11:26 am Subject: error crawling I'm having trouble getting my nutch working. I had it on another server and it was working fine. I migrated it to a new server,

Re: error crawling

2013-05-22 Thread alxsss
what are you trying to achieve? What is the reason running inject with a crawlIId? -Original Message- From: Christopher Gross To: user Sent: Wed, May 22, 2013 12:25 pm Subject: Re: error crawling Sure, I'll try. I'm also confused about this -- I had it working at one point, a

Re: error crawling

2013-05-23 Thread alxsss
I do not think that script works in nutch-2.x. For example I see this $bin/nutch generate $commonOptions $CRAWL_ID/crawldb $CRAWL_ID/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter There are no crawldb or segments in nutch-2.x. When you use crawlid in inject command it creates a

Re: error crawling

2013-05-24 Thread alxsss
Can you send the scrpit? Also are you running it in deploy or local mode? -Original Message- From: Christopher Gross To: user Sent: Fri, May 24, 2013 9:43 am Subject: Re: error crawling Right. "runbot" is the old one. They don't package something with nutch anymore like that.

Re: error crawling

2013-05-28 Thread alxsss
HI, I have seen this script. I thought you have modified it. It will not run even if you remove crawlId, because it does not capture batchId from generate command. Alex. -Original Message- From: Christopher Gross To: user Sent: Tue, May 28, 2013 5:20 am Subject: Re: error cr

Re: Running multiple nutch jobs to fetch a same site with millions of pages

2013-07-01 Thread alxsss
Hi, Try to run more than one reducer by adding numtask param option to the fetch command. hth, Alex. -Original Message- From: weishenyun To: user Sent: Mon, Jul 1, 2013 7:44 pm Subject: Running multiple nutch jobs to fetch a same site with millions of pages Hi, I tried to

  1   2   >