Hello,
I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)
I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml
I have created the core following this description:
https://cwiki.apache
Hello,
I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)
I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml
I have created the core following this description:
https://cwiki.apache
Hi,
Does one of you have some pointers (articles, papers, etc...) or experience
to share about the right way for indexing the html tables content into Solr
Documents?
Thanks!
Benjamin
Hi everyone,
I'm a new user for solr and I need to index some html files based on the
tags and the classes and then complete a web interface to fulfill the
search document search function. Now I have some question about how to
index those html files using my own rules. I have checked the documents
y for
the various mapping parameters.
-- Jack Krupansky
-Original Message-
From: Liz Sommers
Sent: Friday, March 21, 2014 12:56 PM
To: solr-user
Subject: SolrCell and indexing HTML
I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell. (The code is wr
I've never tried indexing via groovy or using solrCell but I think you might be
working a bit too low level in solrj if you're just adding documents. You might
try checking out https://wiki.apache.org/solr/Solrj#Adding_Data_to_Solr and I
might be way off base :)
Thanks,
Greg
On Mar 21, 2014, a
I am trying to write a POC about indexing URL's with Solr using solrJ and
solrCell. (The code is written in groovy).
The relevant code is here
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.setParam("literal.id",p.id.toString())
req.setPa
Hello,
I am using Nutch 1.4 with Solr 3.6.0 and would like to get the HTML keywords
and description metatags indexed into Solr. On the Nutch side I have followed
the http://wiki.apache.org/nutch/IndexMetatags to get nutch parsing the
extracting the metatags (using index-metatags and parse-metat
processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: e
ll be stripped of the of the tags
> during analysis and be searchable just like a normal text field. Then,
> search will not see "".
>
>
> -- Jack Krupansky
>
> -Original Message- From: okayndc
> Sent: Tuesday, May 01, 2012 10:08 AM
> To: solr-user@lucene
will
not see "".
-- Jack Krupansky
-Original Message-
From: okayndc
Sent: Tuesday, May 01, 2012 10:08 AM
To: solr-user@lucene.apache.org
Subject: Re: extracting/indexing HTML via cURL
Thank you Jack.
So, it's not doable/possible to search and highlight keywords with
-- Jack Krupansky
>
> -Original Message- From: okayndc
> Sent: Monday, April 30, 2012 5:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr: extracting/indexing HTML via cURL
>
> Great, thank you for the input. My understanding of HTMLStripCharFilter is
> that it stri
Sent: Monday, April 30, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr: extracting/indexing HTML via cURL
Great, thank you for the input. My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct? I
want to keep the HTML tags i
iginal Message- From: okayndc
> Sent: Monday, April 30, 2012 10:07 AM
> To: solr-user@lucene.apache.org
> Subject: Solr: extracting/indexing HTML via cURL
>
>
> Hello,
>
> Over the weekend I experimented with extracting HTML content via cURL and
> just
> wondering why the e
nday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL
Hello,
Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though
Hello,
Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to in
Thank you so much for your help... I will try it...
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p910555.html
Sent from the Solr - User mailing list archive at Nabble.com.
s.
> It will be great if u answer my question :
> Is there any better approach to achieve the same functionality ?
>
> Regards,
> Siddharth
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p902644.html
>
nts.
It will be great if u answer my question :
Is there any better approach to achieve the same functionality ?
Regards,
Siddharth
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p902644.html
Sent from the Solr - User mailing li
wrote:
>
> Hi,
> I am using SOLR with Apache Tomcat. I have some .html
> files(contains the articles) stored at XYZ location. How can I index these
> .html files in SOLR?
>
> Regards,
> Siddharth
> --
> View this message in context:
> http://lucene.472066.
Hi,
I am using SOLR with Apache Tomcat. I have some .html
files(contains the articles) stored at XYZ location. How can I index these
.html files in SOLR?
Regards,
Siddharth
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR
I use the
> HTMLStripCharFilterFactory?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Lance Norskog
goks...@gmail.com
Do I even need to tidy/clean up the html if I use the
HTMLStripCharFilterFactory?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
Sent from the Solr - User mailing list archive at Nabble.com.
On Jun 9, 2010, at 8:38pm, Blargy wrote:
What is the preferred way to index html using DIH (my html is stored
in a
blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn't
seem to
work well with malformed/incomplete HTML. I've created a custom
tra
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from
context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html
Sent from the Solr - User mailing list archive at Nabble.com.
te html? Thanks
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Lance Norskog
goks...@gmail.com
with malformed/incomplete html? Thanks
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.
Thank you! That's even more I wanted to know. ;)
Georg
On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood wrote:
> You are in luck, because Avi Rappoport has just written a tutorial about
> how to do this. It is available from Lucid Imagination:
>
>
> http://www.lucidimagination.com/solutions/wh
You are in luck, because Avi Rappoport has just written a tutorial about how to
do this. It is available from Lucid Imagination:
http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr
I've just started reviewing it, but knowing Avi, I expect it to be very he
There is an HTML filter documented here, which might be of some help -
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Control characters can be eliminated using code like this -
http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-44
Hi, How to index properly HTML documents? All the documents are HTML, some
containing charaters encodid like ží ... Is there a character
filter for filtering these codes? Is there a way to strip the HTML tags out?
Does solr weight the terms in the document based on where they appear?..
words in hea
Hello Frank,
Answers are inline:
Frank van Lingen said:
> I recently started working with solr and find it easy to setup and
> tinker with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the following (I was not able to find
> documentatio
I think you might be looking for Apache Tika.
On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen wrote:
> I recently started working with solr and find it easy to setup and tinker
> with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the
I recently started working with solr and find it easy to setup and tinker with.
I now want to scale up my setup and was wondering if there is an
application/component that can do the following (I was not able to
find documentation on this on the solr site):
-Can I send solr an xml document with a
hey XpathEntityprocessor does not work with wildcard xpath like '//a...@class'
if you just wish to index htl use a PlaintextEntityProcessor with
HTMLStripTransformer
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen
wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to rec
On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen <
daniel.michael.co...@gmail.com> wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories
> beneath
> it... The indexing seems to encou
amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.
Cheers,
Lance
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML
/HTMLStripWhitespaceTokenizerFactory.java
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Inde
Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for
display. In that case, I use urlencode on the string.
David
McBride, John wrote:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Up
Hi,
Maybe this one?
http://htmlparser.sourceforge.net/
/Jimi
Quoting "McBride, John" <[EMAIL PROTECTED]>:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Do you know of any common way of parsing the text content from HTML
before adding to SOLR? I understand SOLR 1
m: Vinci <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 4:25:10 PM
> Subject: Fields, Facets and Indexing html document
>
>
> Hi all,
>
> I want to Solr to index my html document collection. After I read number
> of
> tutorial and goog
From: Vinci <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 4:25:10 PM
Subject: Fields, Facets and Indexing html document
Hi all,
I want to Solr to index my html document collection. After I read number of
tutorial and google search, I have some questions.
Hi all,
I want to Solr to index my html document collection. After I read number of
tutorial and google search, I have some questions...
1. Can I index html document directly?
2. what should I do on the default schema.xml for indexing html documents?
3. Can fields to be defined by a combination
while searching (anchors/titles etc).
Why is there no documentation about indexing HTML specifically using
solr. How does nutch do it? does it strip out html in the snippets
it returns?
Solr isn't a web search engine, and doesn't do any special processing
of html (although you can ask it t
the resulting html in a webpage. Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead? But
doesn't Solr use HTML features while searching (anchors/titles etc).
Why is there no documentation about indexing HTML specific
On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
What's odd about this is that the error seems to indicate that I did.
Actually the error message looks like you escaped too much. You
should _not_ escape , only the contents of it.
Erik
The full text (minus the stack trace)
What's odd about this is that the error seems to indicate that I did.
The full text (minus the stack trace) was
org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT
to read text (position: START_TAG seen ...... @4:37)
Or is that just a by
Michael,
I think the issue is that you're not escaping the values.
Send something like this to Solr instead:
linktext
a>
Erik
On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
Hello
I'm trying to index individual lines of an HTML file, a
I think you can use the HTMLStripWhitespaceTokenizerFactory.
Look here :
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e
I hope this helps
On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
>
> Hello
>
> I'm trying to index individu
Hello
I'm trying to index individual lines of an HTML file, and I'm hitting this
error:
TEXT must be immediately followed by END_TAG and not START_TAG
I've got something that looks like
4
linktext
Actually, that sample code above, as its own data file POSTed to SOLR,
throws
parser must be
Thanks Jérôme!
It seems to work now. I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.
I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/ for encoding with success. (just
in case someone happens to search this thread)
Ravi
You need to encode your html content so it can be include as a normal
'string' value in your xml element.
As far as remember, the only unsafe characters you have to encode as
entities are:
< -> <
> -> >
" -> "e;
& -> &
(google xml entities to be sure).
I dont know what language you use , but fo
Hello,
Sorry for stupid question. I'm trying to index html file as one of
the fields in Solr, I've setup appropriate analyzer in schema but I'm
not sure how to add html content to Solr. Encapsulating HTML content
within field tag is obviously not valid. How do I add html content?
Hope the query
he.org
Sent: Friday, July 6, 2007 2:19:21 AM
Subject: Re: Indexing HTML and other doc types
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple do
Peter,
I was playing with Nutch for quite some time before Solr, so
I know Nutch better than Solr. Nutch has a plugin mechanism
so that you can add a parser for a document type. It comes with
parser plugins for most popular doc types (with varying degrees of
international text support).
My que
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple document types such
as PDF, DOC, etc and Nutch does not provide functionality to do so you woul
Thank you, Otis and Peter, for your replies.
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> doc of some type -> parse content into various fields -> post to Solr
I understand this part, but the question is who should do this.
I was under assumption that it's Solr client's job to crawl the
A coworker of mine posted the code that we used for adding pdf, doc, xls,
etc documents into solr. You can find the files at the following location.
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Just apply the patch and put the
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, July 3, 2007 8:56:23 PM
Subject: Indexing HTML and other doc types
Solr looks very good for indexing and searching strcutured data.
But I noticed there is no tool in the Solr distribution with which documents
of other doc types can be indexed. Are there other side projects that
develop Solr clients for indexing documents of other doc types?
Or is the generic f
62 matches
Mail list logo