RE: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Yossi Tamari
I struggled with this as well. Eventually I moved to ElasticSearch, which is much easier. What I did manage to find out, is that in newer versions of SOLR you need to use ZooKeeper to update the conf file. see https://stackoverflow.com/a/43351358. -Original Message- From: Pau Paches

RE: After Parse extension point

2017-07-27 Thread Yossi Tamari
Hi Zoltan, I think what you want is a HtmlParseFilter - https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html. I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html, and take a look at one of the included HtmlParseFilters, e.g.

RE: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread Yossi Tamari
Hi Pau, I think the tutorial is still not fully up-to-date: If you haven't, you should update the solr.* properties in nutch-site.xml (and run `ant runtime` again to update the runtime). Then the command for the tutorial should be: bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir

RE: Nutch 1.13 parsing links but ignoring them?

2017-06-29 Thread Yossi Tamari
I figured it out myself. The problem was with db.max.outlinks.per.page having a default value of 100. From: Yossi Tamari [mailto:yossi.tam...@pipl.com] Sent: 26 June 2017 19:26 To: user@nutch.apache.org Subject: Nutch 1.13 parsing links but ignoring them? I'm seeing many cases where

Wrong FS exception in Fetcher

2017-04-30 Thread Yossi Tamari
Hi, I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed distributed mode. Running the command: Deploy/bin/crawl urls crawl 2 The Injector and Generator run successfully, but in the Fetcher I get the following error: 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:

IllegalStateException in CleaningJob on ElasticSearch 2.3.3

2017-05-16 Thread Yossi Tamari
Hi, When running 'crawl -i', I get the following exception in the second iteration, during the CleaningJob: Cleaning up index if possible /data/apache-nutch-1.13/runtime/deploy/bin/nutch clean crawl-inbar/crawldb 17/05/16 05:40:32 INFO indexer.CleaningJob: CleaningJob: starting at

Nutch 1.13 parsing links but ignoring them?

2017-06-26 Thread Yossi Tamari
I'm seeing many cases where ParserChecker finds outlinks in a document, but when running crawl on this document they do not appear in the crawl DB at all (and are not indexed). My URL filters are trivial as far as I can tell, and the missing links are not special in any way that I can see. For

RE: Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari
Thanks Sebastian, The output with set -x is below. I'm new to Nutch and was not aware that 1.13 requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a good idea to document it in the download page and provide a download link (since the Hadoop releases page contains

RE: Wrong FS exception in Fetcher

2017-05-03 Thread Yossi Tamari
Hi, Setting the MapReduce framework to YARN solved this issue. Yossi. From: Yossi Tamari [mailto:yossi.tam...@pipl.com] Sent: 30 April 2017 17:04 To: user@nutch.apache.org Subject: Wrong FS exception in Fetcher Hi, I'm trying to run Nutch 1.13 on Hadoop 2.8.0

RE: Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari
PM, Yossi Tamari wrote: > Thanks Sebastian, > > The output with set -x is below. I'm new to Nutch and was not aware that 1.13 > requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be > a good idea to document it in the download page and provide a downl

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-20 Thread Yossi Tamari
Hi Hiran, I recently needed the documents you requested myself, and the two below were the most helpful. Keep in mind that like most Nutch documentation, they are not totally up to date, so you need to be a bit flexible. The most important difference for me was getting the source from GitHub

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-22 Thread Yossi Tamari
Fork from https://github.com/apache/nutch. -Original Message- From: Hiran CHAUDHURI [mailto:hiran.chaudh...@amadeus.com] Sent: 22 September 2017 12:27 To: user@nutch.apache.org Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading? >Hi Hiran, > >Your code call

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-22 Thread Yossi Tamari
Hi Hiran, Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?

RE: Exchange documents in indexing job

2017-08-23 Thread Yossi Tamari
I don't see a good way to do it in configuration, but it should be very easy to override the write method in the two plugins to have it check the mime type and decide whether to call super.write or not. (One terrible way to do it with configuration only would be to configure only one of the

Sending an empty http.agent.version

2017-10-23 Thread Yossi Tamari
Hi, http.agent.version defaults in nutch-default.xml to Nutch-1.14-SNAPSHOT (depending on the version of course). If I want to override it to not send a version as part of the user-agent, there is nothing I can do in nutch-site.xml, since putting an empty string there causes the default to be

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
e/detect/LanguageDe > tector.html > > The upgrade to Tika 1.16 is already in progress (NUTCH-2439). > > Sebastian > > On 10/24/2017 11:26 AM, Yossi Tamari wrote: > > Hi > > > > > > > > The language-identifier plugin uses > > org.apache.tika.languag

Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
Hi The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that: 1. LanguageIdentifier is deprecated in Tika. 2. It does not support CJK language (and I suspect a lot of other

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
C) because it's faster and has a better precision: > https://github.com/carrotsearch/langid-java.git > https://github.com/saffsd/langid.c.git > https://github.com/saffsd/langid.py.git > Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's > C++). >

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
re also quite > > > low, > gives me pause... > > > > Of course, maintenance or community around a project is an important > > factor. CLD2 is also not really maintained, plus the models are fixed, no > > code > available to retrain them. > > > > >

RE: General question on dealing with file types

2017-11-25 Thread Yossi Tamari
Hi Sol, Note that you do not need to use a regular expression to filter by file suffix, the suffix-urlfilter plugin does that. Obviously, if the URL does not contain the file type, you have to fetch it anyway, to get the mime-type. If there is no parser for this fie type, it will not be parsed

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari
Hi Michael, I think one way you can do it is using `readdb -dump new_crawldb -format crawldb -expr "score>0.03" `. You would then need to use hdfs commands to replace the existing /current with newcrawl_db. Of course, I strongly recommend backing up the current crawldb before replacing it...

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari
Forgot to say: a urlfilter can't do that, since its input is just the URL, without any metadata such as the score. > -Original Message- > From: Yossi Tamari [mailto:yossi.tam...@pipl.com] > Sent: 04 December 2017 21:01 > To: user@nutch.apache.org; 'Michael Coffey' <mc

crawlcomplete

2017-12-04 Thread Yossi Tamari
Hi, I'm trying to understand some of the design decisions behind the crawlcomplete tool. I find the concept itself very useful, but there are a couple of behaviors that I don't understand: 1. URLs that resulted in redirect (even permanent) are counted as unfetched. That means that if I

RE: readseg dump and non-ASCII characters

2017-12-14 Thread Yossi Tamari
Hi Michael, Not directly answering this question, but keep in mind that as mentioned in the issue Sebastian referenced, there are many more places in Nutch that have the same problem, so setting LC_ALL is probably a good idea in general (until that issue is fixed...). If you're worried about

RE: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-15 Thread Yossi Tamari
Hi Semyon, Maybe I'm missing the point, but I don't see why you would want to do this. On one hand, if there is only 1 URL per cycle, why not fetch it? The cost is negligible. On the other hand, imagine this scenario: You find the first link to some host from another host, and you crawl it. But

RE: sitemap and xml crawl

2017-11-02 Thread Yossi Tamari
gt; > date > > > > > > > > > > > > > > > > > > … > > > > > > The other one also includes the content within the xml itself, so it > > doesn’t need > further crawling. > > I have standalone xml parsers

RE: sitemap and xml crawl

2017-11-02 Thread Yossi Tamari
this the right > place, or should I be looking at creating a plugin page. Any advice would be > helpful. > > Thank you, > Ankit Goel > > > On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote: > > > > Hi Ankit, > > > > Accordin

RE: Problems starting crawl from sitemaps

2018-05-24 Thread Yossi Tamari
Hi Chris, In order to inject sitemaps, you should use the "nutch sitemap" command. After you inject those sitemaps to the crawl DB, you can proceed as normal with the crawl command, without the -s parameter. The error you are seeing may be because you have http.content.limit defined. The

RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-25 Thread Yossi Tamari
Hi Markus, I don’t believe this is a valid sitemapindex. Each should include exactly one . See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. I agree that the this is not the ideal error behaviour, but I guess the code was

RE: random sampling of crawlDb urls

2018-05-01 Thread Yossi Tamari
Hi Michael, If you are using 1.14, there is a parameter -sample that allows you to request a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463. Yossi. > -Original Message- > From: Michael Coffey > Sent: 01 May 2018 23:47 > To: User

RE: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

2018-01-26 Thread Yossi Tamari
Hi Rushikesh, I don't have any experience with this specific plugin, but I have run across similar problems, with 2 possible reasons: 1. It is possible that this specific site does not properly declare what encoding it is using, and the browser guesses the correct one. 2. You may have run

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari
Hi Kaliyug, Nutch 2 still requires Hadoop to run, it just allows you to store data somewhere other than HDFS. The only way to run Nutch without Hadoop is local mode, which is only recommended for testing. To do that, run ./runtime/local/bin/crawl. Yossi. > -Original Message- >

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari
gt; To: user@nutch.apache.org > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop > > So what's the whole point of supporting Cassandra or other databases(via > Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would > be doing ? > > On 23 Feb

RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-20 Thread Yossi Tamari
Hi Semyon, Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be issue? As far as I can see the protocol (HTTP/HTTPS) does not play any part in the decision if this is the same domain. Yossi. > -Original Message- > From: Semyon Semyonov

RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva, My suggestion would be to programmatically generate a seeds file containing these 497342 URLs (since you know them in advance), and then use a very low max-depth (probably 1), and a high number of iterations, since only a small number will be fetched in each iteration, unless you set

RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari
Hi Shiva, Having looked at the specific site, I have to amend my recommended max-depth from 1 to 2, since I assume you want to fetch the stories themselves, not just the hubpages. If you want to crawl continuously, as Markus suggested, I still think you should keep the depth at 2, but define

IndexWriter interface in 1.15

2018-09-04 Thread Yossi Tamari
Hi, I missed it at the time, but I just realized (the hard way) that the IndexWriter interface was changed in 1.15 in ways that are not backward compatible. That means that any custom IndexWriter implementation will no longer compile, and probably will not run either. I think this was a

RE: IndexWriter interface in 1.15

2018-09-06 Thread Yossi Tamari
> user Digest 4 Sep 2018 15:53:01 - Issue 2929 > > > > Topics (messages 34147 through 34147) > > > > IndexWriter interface in 1.15 > > 34147 by: Yossi Tamari > > > > Administrivia: > > > > ---

RE: [MASSMAIL]RE: Events out-of-the-box

2018-07-05 Thread Yossi Tamari
sing the included publisher > component. Do you agree with me? > > Regards > > - Mensaje original - > > De: "Yossi Tamari" > > Para: user@nutch.apache.org > > Enviados: Viernes, 29 de Junio 2018 2:09:52 > > Asunto: [MASSMAIL]RE: Ev

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
parsemeta. > Outlinks I can get from outlinkExtractor but what about other parameters? > And again getoutlinks is asking for configuration and i don't know, from > where I > can get it? > > On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Perhaps, however it starts with db, not linkdb (like the other linkdb properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code uses the property name linkdb.max.anchor.length. > -Original Message- > From: Markus Jelsma > Sent: 12 March

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Nutch.default contains a property db.max.outlinks.per.page, which I think is supposed to prevent these cases. However, I just searched the code and couldn't find where it is used. Bug? > -Original Message- > From: Semyon Semyonov > Sent: 12 March 2018 12:47 >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error... > -Original Message- > From: Markus Jelsma > Sent: 12 March 2018 14:01 > To: user@nutch.apache.org > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
bastian Nagel <wastl.na...@googlemail.com> > wrote: > > > Good catch. It should be renamed to be consistent with other > > properties, right? > > > > On 03/12/2018 01:10 PM, Yossi Tamari wrote: > > > Perhaps, however it starts with db, not link

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
that normalizers are always called) > - the minimal solution: add a default rule to regex-urlfilter.txt.template > to limit the length to 512 (or 1024/2048) characters > > > Best, > Sebastian > > [1] > https://github.com/DigitalPebble/storm- > crawler/blob/maste

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari
e. > > Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps > and ev. redirects? > But agreed, you always could also add a rule to regex-urlfilter.txt if > required. But > it should be made clear that only outlinks are checked for length. > Could you reope

RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari
; > > > > > > > > > > name="CustomParse" > point="org.apache.nutch.parse.Parser"> > > class="org.apache.nutch.parse

RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari
Hi Yash, I don't know how to do it, I never tried, but if I had to it would be a trial and error thing If you want to increase the chances that someone will answer your question, I suggest you provide as much information as possible: Where did it not work? In "ant runtime", or when running

RE: RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari
ooth as in case of contract > > implementations(the plugins are contracts, ie interfaces) and can easily > > break > some OOP rules. > > > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM > > From: "Yossi Tamari" <yossi.tam...@pipl.com> &

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
y be chain approach is a better idea to do that but *do parse filter > receives > any DOM object?* as a parameter so by accessing that I can extract the data I > want?? > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com> > wrote: > > >

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
r@nutch.apache.org > Subject: RE: RE: Dependency between plugins > > I tried printing the contents of document fragment in parsefilter-regex by > writing > System.out.println(doc) but its printing null!! And document is getting > parsed!! > > On 15 Mar 2018 13:15, &q

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
Sent: 15 March 2018 10:26 > To: user@nutch.apache.org > Subject: RE: RE: Dependency between plugins > > Yes I am using Html parser and yes the document is getting parsed but > document fragment is printing null. > > On 15 Mar 2018 13:52, "Yossi Tamari" <yossi.tam...@pi

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
t; > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. > > java:624) at java.lang.Thread.run(Thread.java:748) > > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: > > java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
d so can you > please > help me regarding this? > Thanks a lot yossi and Sebastian. > > On 7 Mar 2018 16:11, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > > Yas, just to be sure, you are using the original URL (the one that was in the > ParseResult passed as

RE: No internet connection in Nutch crawler: Proxy configuration -PAC file

2018-04-23 Thread Yossi Tamari
To add to what Lewis said, PAC files are mostly used by browsers, not so much by servers (like Nutch). It is possible your IT department has another proxy configuration that you can use in a server. Keep in mind that a PAC file is just a JavaScript function that translates a URL to proxy

RE: Issues related to Hung threads when crawling more than 15K articles

2018-04-04 Thread Yossi Tamari
I believe this is normal behaviour. The fetch timeout which you have defined (fetcher.timelimit.mins) has passed, and the fetcher is exiting. In this case one of the fetcher threads is still waiting for a response from a specific URL. This is not a problem, and any URLs which were not fetched

Why doesn't hostdb support byDomain mode?

2018-03-04 Thread Yossi Tamari
Hi, Is there a reason that hostdb provides per-host data even when the generate/fetch are working by domain? This generates misleading statistics for servers that load-balance by redirecting to nodes (e.g. photobucket). If this is just an oversight, I can contribute a patch, but I'm not sure

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari
Hi Sebastian, So do you think this fix should be avoided? I wouldn't want to add something that will cause problems for users down the line, but, frankly, I can think of examples of domains that intend their robots.txt to apply across servers and protocols (crawl-delay), but I can't think of

RE: Regarding Internal Links

2018-03-05 Thread Yossi Tamari
You will need to write a HTML Parser Filter plugin. It receives the DOM of the document as parameter, you will have to scan this and isolate the relevant sections, then extract the content of these sections (probably copying code from the HTML parser). Your filter returns a ParseResult, which

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari
Thanks Markus, I will open a ticket and submit a patch. One follow up question: UpdateHostDb checks and throws an exception if urlnormalizer-host (which can be used to mitigate the problem I mentioned) is enabled. Is that also an internal decision of OpenIndex, and perhaps should be removed now

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari
Thanks, I will submit a patch for this. Since this allows me to solve my specific issue, and since Sebastian raised some questions regarding byDomain, I will not proceed with that currently. > -Original Message- > From: Markus Jelsma > Sent: 05 March 2018

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari
n could be also to aggregate the counts by > domain. Usually, the HostDb is orders of magnitude smaller than the CrawlDb, > so this should be considerably fast. > > Best, > Sebastian > > On 03/05/2018 02:03 PM, Yossi Tamari wrote: > > Thanks, I will submit a patch for thi

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yossi Tamari
arding Indexing to elasticsearch > > IndexingJob ( | -all |-reindex) [-crawlId ] This is the output of > nutch index i have already configured the nutch-site.xml. > > On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > > > I suggest

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
You should go over each segment, and for each one produce a ParseText and a ParseData. This is basically what the HTML Parser does for the whole document, which is why I suggested you should dive into its code. A ParseText is basically just a String containing the actual content of the segment

RE: Events out-of-the-box

2018-06-29 Thread Yossi Tamari
This is not something I actually did, but you should be able to achieve this by adding a log4j appender for RabbitMQ (such as https://github.com/plant42/rabbitmq-log4j-appender), and configuring the relevant loggers and filters to send only the logging events you need to that appender. BTW, if

RE: index-replace: variable substitution?

2018-10-12 Thread Yossi Tamari
Hi Ryan, >From looking at the code of index-replace, it uses Java's Matcher.replaceAll > > , so $1 (for example) should work. Yossi. > -Original Message- > From: Ryan Suarez

RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Yossi Tamari
Hi Hany, The Tika parser supports Boilerpipe for header and footer removal, but I don't know how well it works. You can test it online at https://boilerpipe-web.appspot.com/ > -Original Message- > From: hany.n...@hsbc.com > Sent: 14 November 2018 16:53 > To: user@nutch.apache.org >

RE: Nutch 1.15: Solr indexing issue

2018-10-11 Thread Yossi Tamari
I'm using 1.15, but not with Solr. However, the configuration of IndexWriters changed in 1.15, you may want to read https://wiki.apache.org/nutch/IndexWriters#Solr_indexer_properties. Yossi. > -Original Message- > From: hany.n...@hsbc.com > Sent: 11 October 2018 10:20 > To:

RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Yossi Tamari
I think in the case that you interrupt the fetcher, you'll have the problem that URLs that where scheduled to be fetched on the interrupted cycle will never be fetched (because of NUTCH-1842). Yossi. > -Original Message- > From: Markus Jelsma > Sent: 19 November 2018 14:52 >