Re: [MASSMAIL]Nutch not crawling links inside RSS Feeds

2015-05-25 Thread Jorge Luis Betancourt Gonzalez
I don’t think you’ll need to modify the parse-plugin.xml because Tika (the default parser) is capable of handling RSS feeds [1]. Second using the default Nutch distribution without any change, executing a parse checker against the URL you provided, gives me the following output: $ bin/nutch

Re: Removing dupliacte URLs (Solr, Nutch Drupal)

2014-10-15 Thread Jorge Luis Betancourt Gonzalez
The update request processor configuration that you’ve posted looks ok, but this don’t eliminate the already duplicated content in your index this basically will prevent you from indexing duplicated but does nothing about what you already have in the Solr index. When you’re talking about

Nutch vs Lucidworks Fusion

2014-09-30 Thread Jorge Luis Betancourt Gonzalez
Hi the new Fusion product from Lucidworks provides “advanced filesystem and web crawlers” anyone have had any time to check this out and how to compare to the current and future plans with Nutch? Just interested I personally haven’t been able to download the product and test it but I’m a bit

Re: generatorsortvalue

2014-09-10 Thread Jorge Luis Betancourt Gonzalez
this? Is there a complete schema of datas path though each plugins type functions? Benjamin. Envoyé de mon iPad Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu a écrit : You’ll need to write a couple of plugins to accomplish this. Which version of Nutch

Re: Web forum crawling using nutch

2014-09-01 Thread Jorge Luis Betancourt Gonzalez
Don’t think you’ll find all your answers on the out-of-the-box nutch, but you should study some of the extension points Nutch has, as far as I can see you should be able of writing custom plugins that will allow you to achieve your goals, but some programming is required. Greetings, On Aug

Re: HTML tag filtering or parsing?

2014-09-01 Thread Jorge Luis Betancourt Gonzalez
On Sep 1, 2014, at 3:46 AM, xan p...@prateeksachan.com wrote: As a start, I'm able to crawl websites and index the entire content to Solr. But, I want to index only specific content between certain HTML tags instead of the whole page. So, to achieve this, what should I use and how?

Re: Nutch @ApacheCon Europe 2014

2014-08-31 Thread Jorge Luis Betancourt Gonzalez
Great to hear! I’ll be waiting for the Videos, I’ve also submitted a talk about using Nutch for Image retrieval but sadly wasn’t accepted. +1 On a talk about plugin development, showing the life of an url from the seed file to the Solr/backend index showing where you can plug your custom

Re: Error Reindex with Solr

2014-07-20 Thread Jorge Luis Betancourt Gonzalez
Check if the mappings are setup correctly and if you’ve the same schema.xml file in both places: notch and solr. For a more concrete error you could check your solr log, it will tell you what happens. Regards. On Jul 21, 2014, at 1:34 AM, Muhamad Muchlis tru3@gmail.com wrote: Hi All,

Filtering indexing of documents by MIME Type

2014-07-17 Thread Jorge Luis Betancourt Gonzalez
Hi All: I’m currently finishing a custom plugin that allows the filtering of documents in the indexing stage (implemented as and IndexingFilter for Nutch 1.x) basically it allows to configure which type of documents would you like to end up indexed in Solr. The need of this plugin (in our

Re: Changing nutch for update documents instead of add new ones

2014-07-07 Thread Jorge Luis Betancourt Gonzalez
, 2014 at 9:44 AM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Sometime ago for a very particular use case we abstracted this responsability into a custom Solr plugin for a few stored fields. it would handle this case, (don’t just updating a date field, but also keeping

Re: Feasibility questions regarding my new project

2014-07-02 Thread Jorge Luis Betancourt Gonzalez
I’m responding here in some areas where I’ve done something similar to what you need. On Jul 2, 2014, at 12:34 PM, Daniel Sachse m...@wombatsoftware.de wrote: Hey guys, I am working on a new SAAS product regarding website metrics. There are some basic things, I need to achive and I ask

Re: Changing nutch for update documents instead of add new ones

2014-07-01 Thread Jorge Luis Betancourt Gonzalez
Sometime ago for a very particular use case we abstracted this responsability into a custom Solr plugin for a few stored fields. it would handle this case, (don’t just updating a date field, but also keeping a counter on how many times an url is indexed). Of course you need stored fields for

Re: Please share your experience of using Nutch in production

2014-06-23 Thread Jorge Luis Betancourt Gonzalez
Why are you assuming that the web masters are effectively going to block you? In my experience this is the least probable escenario. On Jun 22, 2014, at 4:14 PM, Meraj A. Khan mera...@gmail.com wrote: Gora, Thanks for sharing your admin perspective , rest assured I am not trying to

Re: Identifying Video Links in Pages

2014-05-27 Thread Jorge Luis Betancourt Gonzalez
I’ve done something similar, not with iframes but with other custom needed elements, but the logic will apply. Implement a custom HtmlParseFilter and a IndexingFilter, this way you could control how you want the data to be indexed. But you’re on a right track, perhaps not overriding parse-html,

Re: Nutch survey

2014-05-22 Thread Jorge Luis Betancourt Gonzalez
Done! On May 21, 2014, at 11:56 PM, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Done! Great Julien! On Wed, May 21, 2014 at 10:58 PM, Markus Jelsma markus.jel...@openindex.iowrote: Great! Done! :-)Julien Nioche lists.digitalpeb...@gmail.com schreef:Hi everyone! I had written

Re: Email and blogs crawling

2014-01-28 Thread Jorge Luis Betancourt Gonzalez
For parsing nutch also use Tika which supports a lot of formats [1] including several mail formats. [1] tika.apache.org/0.9/formats.html‎ On Jan 29, 2014, at 12:09 AM, Tejas Patil tejas.patil...@gmail.com wrote: Nutch has these protocols implemented : http, https, ftp, file. As long as you

Re: Manipulating Nutch 2.2.1 scoring system

2013-12-07 Thread Ing. Jorge Luis Betancourt Gonzalez
How can I send the linkrank or opic scoring into solr/hbase ? - Mensaje original - De: Tejas Patil tejas.patil...@gmail.com Para: user@nutch.apache.org Enviados: Sábado, 7 de Diciembre 2013 12:44:16 Asunto: Re: Manipulating Nutch 2.2.1 scoring system Hi Vangelis, You can write your own

Re: Incomplete HTML content of a crawled Page in ParseFilter ?

2013-06-17 Thread Ing. Jorge Luis Betancourt Gonzalez
I've experienced a similar issue on my development station running Mac 10.8 but the same code worked perfectly on my server VM running ubuntu, so no jira was created in the end. Also, in my case was fetching image files and not HTML content + the files was hosted locally so no connection

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-03-02 Thread Jorge Luis Betancourt Gonzalez
How does the subdocuments get indexed into solr? I've thought that the 1 to N wasn't possible with nutch 1.X. - Mensaje original - De: Julien Nioche lists.digitalpeb...@gmail.com Para: user@nutch.apache.org Enviados: Sábado, 2 de Marzo 2013 3:27:35 Asunto: Re: Problem compiling

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

2013-03-02 Thread Jorge Luis Betancourt Gonzalez
Asunto: Re: Problem compiling FeedParser plugin with Nutch 2.1 source Hi Jorge, Afaik it isn't. We're talking about w.x here On Saturday, March 2, 2013, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: How does the subdocuments get indexed into solr? I've thought that the 1 to N wasn't

Re: Nutch 2.1 - Image / Video Search

2013-02-25 Thread Jorge Luis Betancourt Gonzalez
Hi: Like Raja said, it's possible the thing is that out of the box, nutch is only able to index the metadata of the file, you can always write some plugins to implement any logic you desire. - Mensaje original - De: Raja Kulasekaran cull...@gmail.com Para: user@nutch.apache.org

Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Jorge Luis Betancourt Gonzalez
Perhaps this could help: http://www.rui-yang.com/develop/build-nutch-1-4-cluster-with-hadoop/ - Mensaje original - De: Amit Sela am...@infolinks.com Para: user@nutch.apache.org Enviados: Jueves, 21 de Febrero 2013 5:00:29 Asunto: Deploy nutch on existing Hadoop cluster Anyone have a

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
Which could be a good way of specifying which password goes with which PDF file? by full URI or by filename? other? - Mensaje original - De: Julien Nioche lists.digitalpeb...@gmail.com Para: user@nutch.apache.org, John Dhabolt myco...@yahoo.com Enviados: Miércoles, 13 de Febrero 2013

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
to Tika from Nutch for encrypted PDFs? There can be pdf files of same name at different hosts so using the url would be better as compared to name. All this info can be in a xml file which will be read by the pdf plugin. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 10:35 AM, Jorge Luis

Re: How do I pass a password to Tika from Nutch for encrypted PDFs?

2013-02-13 Thread Jorge Luis Betancourt Gonzalez
every url of that host. Thanks, Tejas Patil On Wed, Feb 13, 2013 at 12:57 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: I got this, but really a tedious work to list passwords for each PDF file that will be crawled, don't you think? - Mensaje original - De: Tejas Patil

Re: How to get page content of crawled pages

2013-01-29 Thread Jorge Luis Betancourt Gonzalez
I suppose you can write a custom indexer, to store the data in mongodb instead of solr, I think there is an open repo on github about this. - Mensaje original - De: peterbarretto peterbarrett...@gmail.com Para: user@nutch.apache.org Enviados: Martes, 29 de Enero 2013 8:46:04 Asunto: Re:

Solr dinamic fields

2013-01-28 Thread Jorge Luis Betancourt Gonzalez
Hi: I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract and store the textual content of the files in solr, but right now we want to be able to extract the content of the PDFs by page, this means that, we want to store several

Re: Solr dinamic fields

2013-01-28 Thread Jorge Luis Betancourt Gonzalez
original - De: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu Para: user@nutch.apache.org Enviados: Lunes, 28 de Enero 2013 10:53:04 Asunto: Solr dinamic fields Hi: I'm currently working on a plattform for crawl a large amount of PDFs files. Using nutch (and tika) I'm able of extract

Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread Jorge Luis Betancourt Gonzalez
me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url Hi Alex

Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread Jorge Luis Betancourt Gonzalez
of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url Hi Alex: What you've done

Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread Jorge Luis Betancourt Gonzalez
employer does not want me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content

Access crawled content or parsed data of previous crawled url

2012-11-28 Thread Jorge Luis Betancourt Gonzalez
Hi: For what I've seen in nutch plugins exist the philosophy of one NutchDocument per url, but I was wondering if there is any way of accessing parsed/crawled content of a previous fetched/parsed url, let's say for instance that I've a HTML page with an image embedded: So the start point will

Re: Access crawled content or parsed data of previous crawled url

2012-11-28 Thread Jorge Luis Betancourt Gonzalez
tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 2

Working with images

2012-11-23 Thread Jorge Luis Betancourt Gonzalez
Hi all: I'm trying to write a plugin to detect surrounding text around images inside HTML (img tags). Of course I wrote this plugin implementing HTMLParseFilter and when I got an HTML page I walk through the DocumentFragment detecting the img tags and then detecting the text in the closest

Re: Get full content in a plugin extending HTMLParseFilter

2012-11-21 Thread Jorge Luis Betancourt Gonzalez
) I'm just building the plugin in this machine and testing on Ubuntu GNU/Linux 12.04 works just fine. Would this worth start an issue? It's seems to be just in my particular environment. Greetings On Nov 20, 2012, at 7:16 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi: I'm

Get full content in a plugin extending HTMLParseFilter

2012-11-20 Thread Jorge Luis Betancourt Gonzalez
Hi: I'm adapting a plugin written implementing the Parser interface to generate thumbnails of images (like Julien suggested), the thing is that after extending of the HtmlPArseFilter I'll delegate some metadata detection to Tika and I only worry about the thumbnail generation, the funny thing

Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi people: I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some offline processing of some data crawled nutch (and indexed into solr). Let's take an example: I want to categorize the pages

Re: Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi Thank you for taking the time to reply my email, I'll really appreciate it. I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some offline processing of some data crawled nutch (and indexed

How to restrict a plugin to some specific mimetype

2012-11-12 Thread Jorge Luis Betancourt Gonzalez
Hi all: I've a plugin to generate a thumbnail from images and store this in solr. From a previous thread Julien recommended that this plugin should be rewrited as a HtmlParseFilter, and with this tika could extract the usual metadata from the image, and my custom plugin would generate the

Re: How to restrict a plugin to some specific mimetype

2012-11-12 Thread Jorge Luis Betancourt Gonzalez
Hi Julien: Thanks for your reply, I though that would be a better way to implement this, but it's working!, right know I put my code inside an if block, this is what I use to detect the mime type: if (content.getContentType().contains(image)) {...} Is this a bullet proof way of accomplish

Re: please how to parse image documents with multiple parser ?

2012-11-05 Thread Jorge Luis Betancourt Gonzalez
So, summarizing, If i rewrote the plugin as HTMLParserFilter the document (without any concern about the mime type) would pass into tika and then for all the other parsers right? Greetings! On Nov 1, 2012, at 7:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Sorry for my bad

Re: Image search engine based on nutch/solr

2012-10-21 Thread Jorge Luis Betancourt Gonzalez
Hi, As Lewis say before, if you are going to use nutch for image retrieval and indexing in solr, you'll need to invest some time writing some tools depending on your needs. I've been working on a search engine using nutch for the crawling process and solr as an indexing server, the typical

Re: Image search engine based on nutch/solr

2012-10-21 Thread Jorge Luis Betancourt Gonzalez
Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Sun, Oct 21, 2012 7:26 pm Subject: Re: Image search engine based on nutch/solr Hi, As Lewis say before, if you are going to use nutch for image retrieval and indexing in solr

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Jorge Luis Betancourt Gonzalez
If I want to keep a cache of the websites crawled, something similar to the Google cached view, this way to go with the HBase would be the best option or storing in a filesystem? - Mensaje original - De: Julien Nioche lists.digitalpeb...@gmail.com Para: user@nutch.apache.org Enviados:

Image processing with nutch and metadata detection with Tika

2012-10-05 Thread Jorge Luis Betancourt Gonzalez
Hi: I'm trying to crawl some binary files with Nutch, right now it crawls just fine, but I want to extract all the metadata, the main question I'm facing right now is how do I tell to nutch that send some metadata detected with Tika to an specific solr field? I'm guessing is something in

Heuritics methods for image annotation

2012-09-17 Thread Jorge Luis Betancourt Gonzalez
Hi all: I'm working on an image search engine, using the combination of nutch and solr. With nutch and tika I get some metadata from the images extracted, so far so good. But I'm trying to improve the accuracy of the results using the surrounding text of the images. I know that there are

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Jorge Luis Betancourt Gonzalez
to add a contentType to your implementation in plugin.xml: implementation id=ImageThumbnailParser class=...ImageThumbnailParserparameter name=contentType value=image/png//implementation Good luck! Send from my iphone, Mathijs Homminga On Jun 28, 2012, at 0:12, Jorge Luis Betancourt Gonzalez

Re: Problema with NullPointerException on custom Parser

2012-06-28 Thread Jorge Luis Betancourt Gonzalez
index-metadata) HTH Julien On 27 June 2012 22:07, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cuwrote: Hi, I agree with you, and is a genius idea rely on Tika to parse the files, but in this particular case when all I want to do is encode the content into base64 should I wrote a custom

Problema with NullPointerException on custom Parser

2012-06-27 Thread Jorge Luis Betancourt Gonzalez
Hi all: I'm working on a custom parser plugin to generate thumbnails from images fetched with nutch 1.4. I'm doing this because the humbnails will be converted into a base64 encoded string and stored on a Solr backend. So I basically wrote a custom parser (to which I send all png images, for

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Jorge Luis Betancourt Gonzalez
, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi all: I'm working on a custom parser plugin to generate thumbnails from images fetched with nutch 1.4. I'm doing this because the humbnails will be converted into a base64 encoded string and stored on a Solr backend. So I basically

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Jorge Luis Betancourt Gonzalez
is relied upon by external parsing libraries. Basically we need to define a parser to do the parsing, using Tika as a wrapper for mimeType detection and subsequent parsing saves us a bit of overhead. Lewis On Wed, Jun 27, 2012 at 9:44 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Jorge Luis Betancourt Gonzalez
Parser No need for Tika. Can you send your plugin.xml? Mathijs Homminga On Jun 27, 2012, at 23:07, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Hi, I agree with you, and is a genius idea rely on Tika to parse the files, but in this particular case when all I want to do

Re: Problema with NullPointerException on custom Parser

2012-06-27 Thread Jorge Luis Betancourt Gonzalez
with NullPointerException on custom Parser Hmmm looking at the ParserFactory code, there can actually be several causes for a NullPointerException... Can you also send the parse-plugins.xml? Mathijs Homminga On Jun 27, 2012, at 23:23, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote