Indexing HTML Metatags Nutch - SOLR

2020-01-18 Thread kra...@gds2.de
Hello, I have been trying this for several days without success. (nutch 1.16 - solr 7.3.1) I have followed this description: https://cwiki.apache.org/confluence/display/nutch/IndexMetatags Below I put my file nutch-site.xml I have created the core following this description: https://cwiki.apache

Indexing HTML Metatags Nutch - SOLR

2020-01-18 Thread kra...@gds2.de
Hello, I have been trying this for several days without success. (nutch 1.16 - solr 7.3.1) I have followed this description: https://cwiki.apache.org/confluence/display/nutch/IndexMetatags Below I put my file nutch-site.xml I have created the core following this description: https://cwiki.apache

Indexing HTML table into SOLR

2015-11-11 Thread Sznajder ForMailingList
Hi, Does one of you have some pointers (articles, papers, etc...) or experience to share about the right way for indexing the html tables content into Solr Documents? Thanks! Benjamin

Question about indexing html file

2015-07-17 Thread Huiying Ma
Hi everyone, I'm a new user for solr and I need to index some html files based on the tags and the classes and then complete a web interface to fulfill the search document search function. Now I have some question about how to index those html files using my own rules. I have checked the documents

Re: SolrCell and indexing HTML

2014-03-21 Thread Jack Krupansky
y for the various mapping parameters. -- Jack Krupansky -Original Message- From: Liz Sommers Sent: Friday, March 21, 2014 12:56 PM To: solr-user Subject: SolrCell and indexing HTML I am trying to write a POC about indexing URL's with Solr using solrJ and solrCell. (The code is wr

Re: SolrCell and indexing HTML

2014-03-21 Thread Greg Walters
I've never tried indexing via groovy or using solrCell but I think you might be working a bit too low level in solrj if you're just adding documents. You might try checking out https://wiki.apache.org/solr/Solrj#Adding_Data_to_Solr and I might be way off base :) Thanks, Greg On Mar 21, 2014, a

SolrCell and indexing HTML

2014-03-21 Thread Liz Sommers
I am trying to write a POC about indexing URL's with Solr using solrJ and solrCell. (The code is written in groovy). The relevant code is here ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.setParam("literal.id",p.id.toString()) req.setPa

Solr indexing HTML metatags from Nutch

2012-05-10 Thread ML mail
Hello, I am using Nutch 1.4 with Solr 3.6.0 and would like to get the HTML keywords and description metatags indexed into Solr. On the Nutch side I have followed the http://wiki.apache.org/nutch/IndexMetatags to get nutch parsing the extracting the metatags (using index-metatags and parse-metat

Re: Solr: extracting/indexing HTML via cURL

2012-05-02 Thread Lance Norskog
processing chain, but >> that may be too much effort compared to the HTML strip filter. >> >> -- Jack Krupansky >> >> -Original Message- From: okayndc >> Sent: Monday, April 30, 2012 10:07 AM >> To: solr-user@lucene.apache.org >> Subject: Solr: e

Re: extracting/indexing HTML via cURL

2012-05-01 Thread okayndc
ll be stripped of the of the tags > during analysis and be searchable just like a normal text field. Then, > search will not see "". > > > -- Jack Krupansky > > -Original Message- From: okayndc > Sent: Tuesday, May 01, 2012 10:08 AM > To: solr-user@lucene

Re: extracting/indexing HTML via cURL

2012-05-01 Thread Jack Krupansky
will not see "". -- Jack Krupansky -Original Message- From: okayndc Sent: Tuesday, May 01, 2012 10:08 AM To: solr-user@lucene.apache.org Subject: Re: extracting/indexing HTML via cURL Thank you Jack. So, it's not doable/possible to search and highlight keywords with

Re: extracting/indexing HTML via cURL

2012-05-01 Thread okayndc
-- Jack Krupansky > > -Original Message- From: okayndc > Sent: Monday, April 30, 2012 5:06 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr: extracting/indexing HTML via cURL > > Great, thank you for the input. My understanding of HTMLStripCharFilter is > that it stri

Re: extracting/indexing HTML via cURL

2012-04-30 Thread Jack Krupansky
Sent: Monday, April 30, 2012 5:06 PM To: solr-user@lucene.apache.org Subject: Re: Solr: extracting/indexing HTML via cURL Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags i

Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread okayndc
iginal Message- From: okayndc > Sent: Monday, April 30, 2012 10:07 AM > To: solr-user@lucene.apache.org > Subject: Solr: extracting/indexing HTML via cURL > > > Hello, > > Over the weekend I experimented with extracting HTML content via cURL and > just > wondering why the e

Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread Jack Krupansky
nday, April 30, 2012 10:07 AM To: solr-user@lucene.apache.org Subject: Solr: extracting/indexing HTML via cURL Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though

Solr: extracting/indexing HTML via cURL

2012-04-30 Thread okayndc
Hello, Over the weekend I experimented with extracting HTML content via cURL and just wondering why the extraction/indexing process does not include the HTML tags. It seems as though the HTML tags either being ignored or stripped somewhere in the pipeline. If this is the case, is it possible to in

Re: Indexing HTML files in SOLR

2010-06-20 Thread seesiddharth
Thank you so much for your help... I will try it... -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p910555.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML files in SOLR

2010-06-19 Thread Lance Norskog
s. > It will be great if u answer my question : > Is there any better approach to achieve the same functionality ? > > Regards, > Siddharth > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p902644.html >

Re: Indexing HTML files in SOLR

2010-06-17 Thread seesiddharth
nts. It will be great if u answer my question : Is there any better approach to achieve the same functionality ? Regards, Siddharth -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p902644.html Sent from the Solr - User mailing li

Re: Indexing HTML files in SOLR

2010-06-16 Thread Lance Norskog
wrote: > > Hi, >            I am using SOLR with Apache Tomcat. I have some .html > files(contains the articles) stored at XYZ location. How can I index these > .html files in SOLR? > > Regards, > Siddharth > -- > View this message in context: > http://lucene.472066.

Indexing HTML files in SOLR

2010-06-15 Thread seesiddharth
Hi, I am using SOLR with Apache Tomcat. I have some .html files(contains the articles) stored at XYZ location. How can I index these .html files in SOLR? Regards, Siddharth -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR

Re: Indexing HTML

2010-06-10 Thread Lance Norskog
I use the > HTMLStripCharFilterFactory? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com

Re: Indexing HTML

2010-06-10 Thread Blargy
Do I even need to tidy/clean up the html if I use the HTMLStripCharFilterFactory? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

2010-06-09 Thread Ken Krugler
On Jun 9, 2010, at 8:38pm, Blargy wrote: What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom tra

Re: Indexing HTML

2010-06-09 Thread Blargy
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at index time? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html Sent from

Re: Indexing HTML

2010-06-09 Thread Blargy
context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

2010-06-09 Thread Lance Norskog
te  html? Thanks > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com

Indexing HTML

2010-06-09 Thread Blargy
with malformed/incomplete html? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML document

2010-03-03 Thread György Frivolt
Thank you! That's even more I wanted to know. ;) Georg On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood wrote: > You are in luck, because Avi Rappoport has just written a tutorial about > how to do this. It is available from Lucid Imagination: > > > http://www.lucidimagination.com/solutions/wh

Re: Indexing HTML document

2010-03-02 Thread Walter Underwood
You are in luck, because Avi Rappoport has just written a tutorial about how to do this. It is available from Lucid Imagination: http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr I've just started reviewing it, but knowing Avi, I expect it to be very he

Re: Indexing HTML document

2010-03-02 Thread Siddhant Goel
There is an HTML filter documented here, which might be of some help - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Control characters can be eliminated using code like this - http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-44

Indexing HTML document

2010-03-02 Thread György Frivolt
Hi, How to index properly HTML documents? All the documents are HTML, some containing charaters encodid like ží ... Is there a character filter for filtering these codes? Is there a way to strip the HTML tags out? Does solr weight the terms in the document based on where they appear?.. words in hea

Re: solr application for website crawling and indexing html, pdf, word, ... files

2010-01-25 Thread Markus Jelsma
Hello Frank, Answers are inline: Frank van Lingen said: > I recently started working with solr and find it easy to setup and > tinker with. > > I now want to scale up my setup and was wondering if there is an > application/component that can do the following (I was not able to find > documentatio

Re: solr application for website crawling and indexing html, pdf, word, ... files

2010-01-25 Thread mike anderson
I think you might be looking for Apache Tika. On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen wrote: > I recently started working with solr and find it easy to setup and tinker > with. > > I now want to scale up my setup and was wondering if there is an > application/component that can do the

solr application for website crawling and indexing html, pdf, word, ... files

2010-01-25 Thread Frank van Lingen
I recently started working with solr and find it easy to setup and tinker with. I now want to scale up my setup and was wondering if there is an application/component that can do the following (I was not able to find documentation on this on the solr site): -Can I send solr an xml document with a

Re: Trouble Indexing HTML Files

2009-09-11 Thread Noble Paul നോബിള്‍ नोब्ळ्
hey XpathEntityprocessor does not work with wildcard xpath like '//a...@class' if you just wish to index htl use a PlaintextEntityProcessor with HTMLStripTransformer On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen wrote: > *HI there-** > * > *I'm trying to get the dataimporthandler working to rec

Re: Trouble Indexing HTML Files

2009-09-11 Thread Shalin Shekhar Mangar
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen < daniel.michael.co...@gmail.com> wrote: > *HI there-** > * > *I'm trying to get the dataimporthandler working to recursively parse the > content of a root directory, which contain several other directories > beneath > it... The indexing seems to encou

RE: Indexing HTML Content

2008-05-22 Thread Lance Norskog
amount of string processing it does, the fact that it is a Reader probably does not affect its performance. Cheers, Lance -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 10:14 AM To: solr-user@lucene.apache.org Subject: Re: Indexing HTML

Re: Indexing HTML Content

2008-05-22 Thread Otis Gospodnetic
/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Inde

Re: Indexing HTML Content

2008-05-22 Thread David Arpad Geller
Actually, it's very easy: http://us2.php.net/strip_tags I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string. David McBride, John wrote: Hello, In my application I wish to index articles which are stored in HTML format. Up

Re: Indexing HTML Content

2008-05-22 Thread solr
Hi, Maybe this one? http://htmlparser.sourceforge.net/ /Jimi Quoting "McBride, John" <[EMAIL PROTECTED]>: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable.

Indexing HTML Content

2008-05-22 Thread McBride, John
Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1

Re: Fields, Facets and Indexing html document

2008-03-25 Thread Vinci
m: Vinci <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, March 25, 2008 4:25:10 PM > Subject: Fields, Facets and Indexing html document > > > Hi all, > > I want to Solr to index my html document collection. After I read number > of > tutorial and goog

Re: Fields, Facets and Indexing html document

2008-03-25 Thread Otis Gospodnetic
From: Vinci <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, March 25, 2008 4:25:10 PM Subject: Fields, Facets and Indexing html document Hi all, I want to Solr to index my html document collection. After I read number of tutorial and google search, I have some questions.

Fields, Facets and Indexing html document

2008-03-25 Thread Vinci
Hi all, I want to Solr to index my html document collection. After I read number of tutorial and google search, I have some questions... 1. Can I index html document directly? 2. what should I do on the default schema.xml for indexing html documents? 3. Can fields to be defined by a combination

Re: Indexing HTML

2007-10-04 Thread Mike Klaas
while searching (anchors/titles etc). Why is there no documentation about indexing HTML specifically using solr. How does nutch do it? does it strip out html in the snippets it returns? Solr isn't a web search engine, and doesn't do any special processing of html (although you can ask it t

Re: Indexing HTML

2007-10-03 Thread Ravish Bhagdev
the resulting html in a webpage. Is it possible to strip out all HTML tags completely in result set? Would you recommend sending stripped out text to solr instead? But doesn't Solr use HTML features while searching (anchors/titles etc). Why is there no documentation about indexing HTML specific

Re: Indexing HTML

2007-08-27 Thread Erik Hatcher
On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote: What's odd about this is that the error seems to indicate that I did. Actually the error message looks like you escaped too much. You should _not_ escape , only the contents of it. Erik The full text (minus the stack trace)

Re: Indexing HTML

2007-08-27 Thread Michael Kimsal
What's odd about this is that the error seems to indicate that I did. The full text (minus the stack trace) was org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT to read text (position: START_TAG seen ...... @4:37) Or is that just a by

Re: Indexing HTML

2007-08-27 Thread Erik Hatcher
Michael, I think the issue is that you're not escaping the values. Send something like this to Solr instead: linktext Erik On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote: Hello I'm trying to index individual lines of an HTML file, a

Re: Indexing HTML

2007-08-27 Thread Thierry Collogne
I think you can use the HTMLStripWhitespaceTokenizerFactory. Look here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e I hope this helps On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote: > > Hello > > I'm trying to index individu

Indexing HTML

2007-08-27 Thread Michael Kimsal
Hello I'm trying to index individual lines of an HTML file, and I'm hitting this error: TEXT must be immediately followed by END_TAG and not START_TAG I've got something that looks like 4 linktext Actually, that sample code above, as its own data file POSTed to SOLR, throws parser must be

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Thanks Jérôme! It seems to work now. I just hope the provided HTMLStripWhitespaceTokenizerFactory will strip the right tags now. I use Java and used HtmlEncoder provided in http://itext.ugent.be/library/api/ for encoding with success. (just in case someone happens to search this thread) Ravi

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Jérôme Etévé
You need to encode your html content so it can be include as a normal 'string' value in your xml element. As far as remember, the only unsafe characters you have to encode as entities are: < -> < > -> > " -> "e; & -> & (google xml entities to be sure). I dont know what language you use , but fo

Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Hello, Sorry for stupid question. I'm trying to index html file as one of the fields in Solr, I've setup appropriate analyzer in schema but I'm not sure how to add html content to Solr. Encapsulating HTML content within field tag is obviously not valid. How do I add html content? Hope the query

Re: Indexing HTML and other doc types

2007-07-06 Thread Otis Gospodnetic
he.org Sent: Friday, July 6, 2007 2:19:21 AM Subject: Re: Indexing HTML and other doc types I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple do

RE: Indexing HTML and other doc types

2007-07-06 Thread Teruhiko Kurosaka
Peter, I was playing with Nutch for quite some time before Solr, so I know Nutch better than Solr. Nutch has a plugin mechanism so that you can add a parser for a document type. It comes with parser plugins for most popular doc types (with varying degrees of international text support). My que

Re: Indexing HTML and other doc types

2007-07-05 Thread Peter Manis
I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple document types such as PDF, DOC, etc and Nutch does not provide functionality to do so you woul

RE: Indexing HTML and other doc types

2007-07-05 Thread Teruhiko Kurosaka
Thank you, Otis and Peter, for your replies. > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > doc of some type -> parse content into various fields -> post to Solr I understand this part, but the question is who should do this. I was under assumption that it's Solr client's job to crawl the

Re: Indexing HTML and other doc types

2007-07-04 Thread Peter Manis
A coworker of mine posted the code that we used for adding pdf, doc, xls, etc documents into solr. You can find the files at the following location. https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Just apply the patch and put the

Re: Indexing HTML and other doc types

2007-07-03 Thread Otis Gospodnetic
Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Teruhiko Kurosaka <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, July 3, 2007 8:56:23 PM Subject: Indexing HTML and other doc types

Indexing HTML and other doc types

2007-07-03 Thread Teruhiko Kurosaka
Solr looks very good for indexing and searching strcutured data. But I noticed there is no tool in the Solr distribution with which documents of other doc types can be indexed. Are there other side projects that develop Solr clients for indexing documents of other doc types? Or is the generic f