Re: exception while feeding converted text from pdf

2008-05-14 Thread Brian Carmalt
Hello Cam,

Are you writing your xml by hand, as in no xml writer? That can cause
problems. In your exception it says "latitude 59&", the & should have
converted to '&'(I think). If you can use Java6, there is a
XMLStreamWriter in java.xml.stream that does automatic special character
escaping. This can simplify writing simple xml.

Unfortunatly the stream writer does not filter out invalid xml
characters. So I will point you to a helpful website: 
http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html


Hope this helps.

Brian

Am Mittwoch, den 14.05.2008, 19:23 +0300 schrieb Cam Bazz:
> Hello,
> 
> I made a simple java program to convert my pdfs to text, and then to xml
> file.
> I am getting a strange exception. I think the converted files have some
> errors. should I encode the txt string that I extract from the pdfs in a
> special way?
> 
> Best,
> -C.B.
> 
> EVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can not
> start with character ' ' (position: START_TAG seen
> ...ay\n  latitude 59& ...
> @80:64)
> at org.xmlpull.mxp1.MXParser.parseEntityRef(MXParser.java:2212)
> at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1275)
> at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
> at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)



Fwd: [Solr Wiki] Update of "DataImportHandler" by YukiDog

2008-05-14 Thread Shalin Shekhar Mangar
Hello,

If you find a problem with DataImportHandler or the wiki documentation, then
please do report it back in the mailing list so that we may have a chance to
verify your problem and propose solutions. It may help us improve the tool
as well as the documentation.

In the wiki edit below, the changes were actually incorrect so I've reverted
them. The variables used do not need to include any entity's name except for
the entity to which they belong. If you observe something different, it may
be a bug or the usage maybe incorrect. For example:

This does not work -- "select description as cat from category where id =
'${item.item_category.CATEGORY_ID}"
This works correctly -- "select description as cat from category where id =
'${item_category.CATEGORY_ID}"

Thanks!

-- Forwarded message --
From: Apache Wiki <[EMAIL PROTECTED]>
Date: Thu, May 15, 2008 at 11:02 AM
Subject: [Solr Wiki] Update of "DataImportHandler" by YukiDog
To: [EMAIL PROTECTED]


Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for
change notification.

The following page has been changed by YukiDog:
http://wiki.apache.org/solr/DataImportHandler

--
 
 
 
- 
+ 
- 
+ 
 
 
 
@@ -179, +179 @@



 }}}
- The ''item_id'' foreign key in feature table is joined together with
''id'' primary key in ''item'' to retrieve rows for each row in ''item''. In
a similar fashion, we join ''item'' and 'category' (which is a many-to-many
relationship). Notice how we join these two tables using the intermediate
table ''item_category'' again using templated SQL.
+ The ''item_id'' foreign key in feature table is joined together with
''id'' primary key in ''item'' to retrieve rows for each row in ''item''. In
a similar fashion, we join ''item'' and 'category' (which is a many-to-many
relationship). Notice how we join these two tables using the intermediate
table ''item_category'' again using templated SQL.  Also notice how the
variable used in the ''category'' entity must include the name of each
parent entity up to the root.
 {{{
  
- 
+ 
 
 
 
@@ -198, +198 @@

 
 
 
- 
+ 
- 
+ 
 
 
 
@@ -229, +229 @@

 
 
+query="select description as cat from category
where id = '${item.item_category.CATEGORY_ID}'">
 
 
 



-- 
Regards,
Shalin Shekhar Mangar.


Re: Field Grouping

2008-05-14 Thread oleg_gnatovskiy

Yes, that is the patch I am trying to get to work. It doesn't have a feature
for distributed search.


ryantxu wrote:
> 
> You may want to check "field collapsing"
> https://issues.apache.org/jira/browse/SOLR-236
> 
> There is a patch that works against 1.2, but the one for trunk needs  
> some work before it can work...
> 
> ryan
> 
> 
> On May 13, 2008, at 2:46 PM, oleg_gnatovskiy wrote:
>>
>> There is an XSLT example here:
>> http://wiki.apache.org/solr/XsltResponseWriter
>> , but it doesn't seem like that would work either... This example  
>> would only
>> do a group by for the current page. If I use Solr for pagination,  
>> this would
>> not work for me.
>>
>>
>> oleg_gnatovskiy wrote:
>>>
>>> But I don't want the search results to be ranked based on that  
>>> field. I
>>> only want all the documents with the same value grouped together...  
>>> The
>>> way my system is set up, most documents will have that field empty.  
>>> Thus,
>>> if Is rot by it, those documents that have a value will bubble to the
>>> top...
>>>
>>>
>>>
>>> Yonik Seeley wrote:

 On Mon, May 12, 2008 at 9:58 PM, oleg_gnatovskiy
 <[EMAIL PROTECTED]> wrote:
> Hello. I was wondering if there is a way to get solr to return  
> fields
> with
> the same value for a particular field together. For example I might
> want to
> have all the documents with exactly the same name field all  
> returned
> next to
> each other. Is this possible? Thanks!

 Sort by that field.  Since you can only sort by fields with a single
 term at most (this rules out full-text fields), you might want to  
 do a
 copyField of the "name" field to something like a "name_s" field  
 which
 is of type string (which can be sorted on).

 -Yonik


>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Field-Grouping-tp17199592p17215641.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Field-Grouping-tp17199592p17244589.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Chinese Language + Solr

2008-05-14 Thread j . L
I don't know the cost.

I know the bigger chinese search use it.

More chinese people who study and use full-text search think it is the best
chinese analyzer  which u can buy.

Baidu(www.baidu.com), is the biggest chinese search, and googlechina is the
No 2.

Baidu not use it (http://www.hylanda.com/ ),
they use theirself chinese analyzer.




On Thu, May 15, 2008 at 8:45 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Out of curiosity, what's the cost (the site is in Chinese, so I can't tell
> :( )?
> BasisTech are the main people for this type of stuff.  Expensive, though, I
> believe.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>


-- 
regards
j.L


Re: Chinese Language + Solr

2008-05-14 Thread Otis Gospodnetic
Out of curiosity, what's the cost (the site is in Chinese, so I can't tell :( )?
BasisTech are the main people for this type of stuff.  Expensive, though, I 
believe.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: j.L <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 8:38:22 PM
> Subject: Re: Chinese Language + Solr
> 
> if commercial analyzers, i recommend http://www.hylanda.com/(it is the best
> analyzer in chinese word)
> 
> On Thu, May 15, 2008 at 8:32 AM, j. L wrote:
> 
> > u can try je-analyzer,,,i  building 17m docs search site by solr and
> > je-analyzer
> >
> >
> > On Thu, May 15, 2008 at 6:44 AM, Walter Underwood 
> > wrote:
> >
> >> N-gram works pretty well for Chinese, there are even studies to
> >> back that up.
> >>
> >> Do not use the N-gram matches for highlighting. They look really
> >> stupid to native speakers.
> >>
> >> wunder
> >>
> >> On 5/14/08 2:03 PM, "Otis Gospodnetic" 
> >> wrote:
> >>
> >> > There are no free morphological analyzers for Chinese (are there for any
> >> > language?) that I know.  People tend to use one of the n-gram analyzers
> >> from
> >> > Lucene contrib.  I've used them before and they do OK.
> >> >
> >> >
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >> >
> >> >
> >> > - Original Message 
> >> >> From: Francisco Sanmartin 
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Wednesday, May 14, 2008 4:54:05 PM
> >> >> Subject: Chinese Language + Solr
> >> >>
> >> >> I have had successful experiences using Sorl with an English website,
> >> >> and now I am going to deploy Solr in a chinese site. I've been looking
> >> >> in the mailing list and there are some useful information in the old
> >> posts.
> >> >> But, we would like some kind of feedback of the people who already have
> >> >> deployed Solr in any CJK Language.
> >> >>
> >> >> Is there any free and good analyzer? (Preferible morphological)
> >> >> Among all the commercial analyzers, what would you recommend? Is there
> >> >> any of them that works ok out-of-the-box with Solr?
> >> >>
> >> >> Thanks in advance.
> >> >>
> >> >> Pako
> >> >
> >>
> >>
> >
> >
> > --
> > regards
> > j.L
> 
> 
> 
> 
> -- 
> regards
> j.L



Re: result count query

2008-05-14 Thread Otis Gospodnetic
There is no way to know without doing the search.  Using rows=0 you are really 
just avoiding getting the actual hits in the response.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: solr_user <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 8:33:18 PM
> Subject: Re: result count query
> 
> 
> Thanks Otis,
> 
>   Actually what I really want to do is just check whether the query is going
> to return any results or not.  I tried the rows=0 thing and that works quite
> efficiently.  Just wondering if there is anything even more efficient then
> that that will answer whether the query has any hits or not.
> 
> Solr-user
> 
> 
> Otis Gospodnetic wrote:
> > 
> > I think specifying rows=0 in the URL gets you that number without giving
> > you the actual results.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > - Original Message 
> >> From: solr_user 
> >> To: solr-user@lucene.apache.org
> >> Sent: Wednesday, May 14, 2008 4:53:05 PM
> >> Subject: result count query
> >> 
> >> 
> >> Hi,
> >> 
> >>   Is there an efficient way to just get the result count of a query
> >> issued
> >> to Solr?
> >> 
> >> Solr-user
> >> -- 
> >> View this message in context: 
> >> http://www.nabble.com/result-count-query-tp17240159p17240159.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Re%3A-result-count-query-tp17240818p17243737.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Chinese Language + Solr

2008-05-14 Thread j . L
if u can read chinese and wanna write ur chinese-analyzer,,, maybe u can see
it http://www.googlechinablog.com/2006/04/blog-post_10.html



2008/5/15 j. L <[EMAIL PROTECTED]>:

> if commercial analyzers, i recommend 
> http://www.hylanda.com/(itis the best analyzer 
> in chinese word)
>
>
> On Thu, May 15, 2008 at 8:32 AM, j. L <[EMAIL PROTECTED]> wrote:
>
>> u can try je-analyzer,,,i  building 17m docs search site by solr and
>> je-analyzer
>>
>>
>> On Thu, May 15, 2008 at 6:44 AM, Walter Underwood <[EMAIL PROTECTED]>
>> wrote:
>>
>>> N-gram works pretty well for Chinese, there are even studies to
>>> back that up.
>>>
>>> Do not use the N-gram matches for highlighting. They look really
>>> stupid to native speakers.
>>>
>>> wunder
>>>
>>> On 5/14/08 2:03 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> > There are no free morphological analyzers for Chinese (are there for
>>> any
>>> > language?) that I know.  People tend to use one of the n-gram analyzers
>>> from
>>> > Lucene contrib.  I've used them before and they do OK.
>>> >
>>> >
>>> > Otis
>>> > --
>>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >
>>> >
>>> > - Original Message 
>>> >> From: Francisco Sanmartin <[EMAIL PROTECTED]>
>>> >> To: solr-user@lucene.apache.org
>>> >> Sent: Wednesday, May 14, 2008 4:54:05 PM
>>> >> Subject: Chinese Language + Solr
>>> >>
>>> >> I have had successful experiences using Sorl with an English website,
>>> >> and now I am going to deploy Solr in a chinese site. I've been looking
>>> >> in the mailing list and there are some useful information in the old
>>> posts.
>>> >> But, we would like some kind of feedback of the people who already
>>> have
>>> >> deployed Solr in any CJK Language.
>>> >>
>>> >> Is there any free and good analyzer? (Preferible morphological)
>>> >> Among all the commercial analyzers, what would you recommend? Is there
>>> >> any of them that works ok out-of-the-box with Solr?
>>> >>
>>> >> Thanks in advance.
>>> >>
>>> >> Pako
>>> >
>>>
>>>
>>
>>
>> --
>> regards
>> j.L
>
>
>
>
> --
> regards
> j.L




-- 
regards
j.L


Re: Chinese Language + Solr

2008-05-14 Thread j . L
if commercial analyzers, i recommend http://www.hylanda.com/(it is the best
analyzer in chinese word)

On Thu, May 15, 2008 at 8:32 AM, j. L <[EMAIL PROTECTED]> wrote:

> u can try je-analyzer,,,i  building 17m docs search site by solr and
> je-analyzer
>
>
> On Thu, May 15, 2008 at 6:44 AM, Walter Underwood <[EMAIL PROTECTED]>
> wrote:
>
>> N-gram works pretty well for Chinese, there are even studies to
>> back that up.
>>
>> Do not use the N-gram matches for highlighting. They look really
>> stupid to native speakers.
>>
>> wunder
>>
>> On 5/14/08 2:03 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]>
>> wrote:
>>
>> > There are no free morphological analyzers for Chinese (are there for any
>> > language?) that I know.  People tend to use one of the n-gram analyzers
>> from
>> > Lucene contrib.  I've used them before and they do OK.
>> >
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> > - Original Message 
>> >> From: Francisco Sanmartin <[EMAIL PROTECTED]>
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Wednesday, May 14, 2008 4:54:05 PM
>> >> Subject: Chinese Language + Solr
>> >>
>> >> I have had successful experiences using Sorl with an English website,
>> >> and now I am going to deploy Solr in a chinese site. I've been looking
>> >> in the mailing list and there are some useful information in the old
>> posts.
>> >> But, we would like some kind of feedback of the people who already have
>> >> deployed Solr in any CJK Language.
>> >>
>> >> Is there any free and good analyzer? (Preferible morphological)
>> >> Among all the commercial analyzers, what would you recommend? Is there
>> >> any of them that works ok out-of-the-box with Solr?
>> >>
>> >> Thanks in advance.
>> >>
>> >> Pako
>> >
>>
>>
>
>
> --
> regards
> j.L




-- 
regards
j.L


Re: result count query

2008-05-14 Thread solr_user

Thanks Otis,

  Actually what I really want to do is just check whether the query is going
to return any results or not.  I tried the rows=0 thing and that works quite
efficiently.  Just wondering if there is anything even more efficient then
that that will answer whether the query has any hits or not.

Solr-user


Otis Gospodnetic wrote:
> 
> I think specifying rows=0 in the URL gets you that number without giving
> you the actual results.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: solr_user <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, May 14, 2008 4:53:05 PM
>> Subject: result count query
>> 
>> 
>> Hi,
>> 
>>   Is there an efficient way to just get the result count of a query
>> issued
>> to Solr?
>> 
>> Solr-user
>> -- 
>> View this message in context: 
>> http://www.nabble.com/result-count-query-tp17240159p17240159.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Re%3A-result-count-query-tp17240818p17243737.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Chinese Language + Solr

2008-05-14 Thread j . L
u can try je-analyzer,,,i  building 17m docs search site by solr and
je-analyzer

On Thu, May 15, 2008 at 6:44 AM, Walter Underwood <[EMAIL PROTECTED]>
wrote:

> N-gram works pretty well for Chinese, there are even studies to
> back that up.
>
> Do not use the N-gram matches for highlighting. They look really
> stupid to native speakers.
>
> wunder
>
> On 5/14/08 2:03 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
>
> > There are no free morphological analyzers for Chinese (are there for any
> > language?) that I know.  People tend to use one of the n-gram analyzers
> from
> > Lucene contrib.  I've used them before and they do OK.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > - Original Message 
> >> From: Francisco Sanmartin <[EMAIL PROTECTED]>
> >> To: solr-user@lucene.apache.org
> >> Sent: Wednesday, May 14, 2008 4:54:05 PM
> >> Subject: Chinese Language + Solr
> >>
> >> I have had successful experiences using Sorl with an English website,
> >> and now I am going to deploy Solr in a chinese site. I've been looking
> >> in the mailing list and there are some useful information in the old
> posts.
> >> But, we would like some kind of feedback of the people who already have
> >> deployed Solr in any CJK Language.
> >>
> >> Is there any free and good analyzer? (Preferible morphological)
> >> Among all the commercial analyzers, what would you recommend? Is there
> >> any of them that works ok out-of-the-box with Solr?
> >>
> >> Thanks in advance.
> >>
> >> Pako
> >
>
>


-- 
regards
j.L


Re: Stop words and exact phrase

2008-05-14 Thread Walter Underwood
Sorry, I was hurrying before class (training to get a service dog).
I use the DisMax handler, which can expand a query to go against
multiple fields. The per-field analysis applies at both index and
query time, so the exact field does not have stopwords removed.
Very helpful for queries like "Being There" or "To be and to have".

wunder

On 5/14/08 10:54 AM, "cricdigs" <[EMAIL PROTECTED]> wrote:

> 
> Hi wunder,
> 
> Thanks for your response. I am still a little confused. Solr's analysis page
> shows that the stop word is removed from the query - its got nothing to do
> with the indexing imo.
> 
> If indexing has removed the stop words then I should not get any results
> right? But I get the results with the stop word removed.
> 
> How do I tell Solr to send phrase queries to a field other than default?
> Will I have to code that or is it just a config setting?
> 
> Thanks.
> 
> 
> Walter Underwood wrote:
>> 
>> Try creating a separate field that does not remove stopwords,
>> populating that with  and configuring the phrase
>> queries to go against that field instead.
>> 
>> I do something similar. For both regular and phrase queries,
>> we have a stemmed and stopped field and another field with
>> neither. The "exact" field has a higher boost. This helps
>> with movies like "Saw" and "Ran", which should not show
>> "see" or "run" as the top match.
>> 
>> wunder
>> 
>> On 5/14/08 8:09 AM, "cricdigs" <[EMAIL PROTECTED]> wrote:
>> 
>>> 
>>> Hi all,
>>> 
>>> Is there a config setting that I could use to not remove stop words when
>>> doing an exact phrase match. For example when searching for "the world"
>>> (in
>>> quotes) I would like to look for just that and not get results for just
>>> "world". When I look at the analysis, I see that word "the" is removed by
>>> the StopFilter even if it is in quotes. So is there a work-around to
>>> solve
>>> this?
>>> 
>>> Thanks!
>> 
>> 
>> 



Re: Chinese Language + Solr

2008-05-14 Thread Walter Underwood
N-gram works pretty well for Chinese, there are even studies to
back that up.

Do not use the N-gram matches for highlighting. They look really
stupid to native speakers.

wunder

On 5/14/08 2:03 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> There are no free morphological analyzers for Chinese (are there for any
> language?) that I know.  People tend to use one of the n-gram analyzers from
> Lucene contrib.  I've used them before and they do OK.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: Francisco Sanmartin <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, May 14, 2008 4:54:05 PM
>> Subject: Chinese Language + Solr
>> 
>> I have had successful experiences using Sorl with an English website,
>> and now I am going to deploy Solr in a chinese site. I've been looking
>> in the mailing list and there are some useful information in the old posts.
>> But, we would like some kind of feedback of the people who already have
>> deployed Solr in any CJK Language.
>> 
>> Is there any free and good analyzer? (Preferible morphological)
>> Among all the commercial analyzers, what would you recommend? Is there
>> any of them that works ok out-of-the-box with Solr?
>> 
>> Thanks in advance.
>> 
>> Pako
> 



RE: solr highlighting

2008-05-14 Thread Kevin Xiao
Yes. I did all that. Maybe my custom analyzer conflicts with highlighting. 
Thanks for the tips.

- Kevin

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 14, 2008 12:03 PM
To: solr-user@lucene.apache.org
Subject: Re: solr highlighting

The minimum "stuff" needed to highlight term X in field F is:

field F must be 'stored'
field F must have an analyzer defined
a query with term X is sent (e.g., q=X)
with parameters hl=true (or 'on'), hl.fl=F

Try it on the example:
1. get the example running
2. cd example/exampledocs
3. ./post.sh *.xml
4. execute a query:

http://localhost:8983/solr/select?indent=on&version=2.2&q=solr&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=features

-Mike

On 14-May-08, at 9:39 AM, Kevin Xiao wrote:

> Thanks Christian. I did try many options indicated in wiki, didn't
> work. So I want to see if the basics work, i.e. only define hl=true
> and a field for hl.fl. Do I need to include something global to make
> hl settings work?
>
> Thanks,
> - Kevin
>
> -Original Message-
> From: Christian Vogler [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 14, 2008 5:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: solr highlighting
>
> On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:
>> Hi there,
>>
>> I am new to solr. I want search term to be highlighted on the
>> results. I
>> thought it is pretty simple, but could not make it work. I read a
>> lot of
>> solr documents and mail archives (I wish there is a search function
>> for
>> this, we are talking about solr, aren’t we? ☺).
>
> Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as
> per
> http://wiki.apache.org/solr/HighlightingParameters.
>
> In particular, setting hl.fragsize to 0 might be what you want if I
> understand
> your question correctly.
>
> Best regards
> - Christian
> --
> Christian Vogler, Ph.D.
> Institute for Language and Speech Processing, Athens, Greece
> http://gri.gallaudet.edu/~cvogler/
> [EMAIL PROTECTED]



Re: Fwd: Grouping products

2008-05-14 Thread Tricia Williams
Perhaps the Synonym Filter would work for this.  
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters will tell 
you more.


Tricia

Otis Gospodnetic wrote:

Hi Vender,

Solr can't do the grouping for you.  Solr can do the searching/finding for you, 
but it won't be able to recognize different model names and figure out which 
ones represent the same product.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Vender Livre <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 14, 2008 5:01:01 PM
Subject: Fwd: Grouping products

-- Forwarded message --
From: Vender Livre 
Date: Wed, May 14, 2008 at 5:59 PM

Subject: Grouping products
To: [EMAIL PROTECTED]


Hi, I'm working in a software that must group similar products.

For example:

CANON IP1300

PRINTER CANON IP 1300

IP1300 CANON PRINTER BLACK

the app should group these three names, because they are the same product.
Someone told me SOLR should solve my problem. Is this true? Where could I
learn more about it?

Thanks




  




Re: Fwd: Grouping products

2008-05-14 Thread Vender Livre
Thanks.

I will study more about it.

Cheers.

On Wed, May 14, 2008 at 6:29 PM, Daniel Papasian <
[EMAIL PROTECTED]> wrote:

> Vender Livre wrote:
>
>> But it can find the most probable product, can't it?
>>
>> Is there a library or tool that do something like that?
>>
>> Someone told me SOLR would solve this problem.
>>
>
> I wouldn't say solr would solve this problem... sounds like someone sold
> you snake oil!
>
> If you wanted to use solr, I think your best bet is to use a nightly and
> run a MoreLikeThis query - http://wiki.apache.org/solr/MoreLikeThis - but
> whether that's going to work well for you with so few terms, I have no idea.
>  Good luck!
>
> Daniel
>


Re: Fwd: Grouping products

2008-05-14 Thread Daniel Papasian

Vender Livre wrote:

But it can find the most probable product, can't it?

Is there a library or tool that do something like that?

Someone told me SOLR would solve this problem.


I wouldn't say solr would solve this problem... sounds like someone sold 
you snake oil!


If you wanted to use solr, I think your best bet is to use a nightly and 
run a MoreLikeThis query - http://wiki.apache.org/solr/MoreLikeThis - 
but whether that's going to work well for you with so few terms, I have 
no idea.  Good luck!


Daniel


Re: Fwd: Grouping products

2008-05-14 Thread Vender Livre
Would be easier implementing this idea with SOLR than with Lucene?

I'm a bit confused. Thanks for help.

On Wed, May 14, 2008 at 6:21 PM, Vender Livre <[EMAIL PROTECTED]> wrote:

> But it can find the most probable product, can't it?
>
> Is there a library or tool that do something like that?
>
> Someone told me SOLR would solve this problem.
>
> The idea i had was to get a product name and match it against other names,
> and then find the best scored. Then I would group the product to this match.
>
>
> On Wed, May 14, 2008 at 6:13 PM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Vender,
>>
>> Solr can't do the grouping for you.  Solr can do the searching/finding for
>> you, but it won't be able to recognize different model names and figure out
>> which ones represent the same product.
>>
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>> - Original Message 
>> > From: Vender Livre <[EMAIL PROTECTED]>
>> > To: solr-user@lucene.apache.org
>> > Sent: Wednesday, May 14, 2008 5:01:01 PM
>> > Subject: Fwd: Grouping products
>> >
>> > -- Forwarded message --
>> > From: Vender Livre
>> > Date: Wed, May 14, 2008 at 5:59 PM
>> > Subject: Grouping products
>> > To: [EMAIL PROTECTED]
>> >
>> >
>> > Hi, I'm working in a software that must group similar products.
>> >
>> > For example:
>> >
>> > CANON IP1300
>> >
>> > PRINTER CANON IP 1300
>> >
>> > IP1300 CANON PRINTER BLACK
>> >
>> > the app should group these three names, because they are the same
>> product.
>> > Someone told me SOLR should solve my problem. Is this true? Where could
>> I
>> > learn more about it?
>> >
>> > Thanks
>>
>>
>


-- 
.:: Rafael Barbolo Lopes ::.
http://barbolo.polinvencao.com/


Re: Fwd: Grouping products

2008-05-14 Thread Vender Livre
But it can find the most probable product, can't it?

Is there a library or tool that do something like that?

Someone told me SOLR would solve this problem.

The idea i had was to get a product name and match it against other names,
and then find the best scored. Then I would group the product to this match.

On Wed, May 14, 2008 at 6:13 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Hi Vender,
>
> Solr can't do the grouping for you.  Solr can do the searching/finding for
> you, but it won't be able to recognize different model names and figure out
> which ones represent the same product.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> - Original Message 
> > From: Vender Livre <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Wednesday, May 14, 2008 5:01:01 PM
> > Subject: Fwd: Grouping products
> >
> > -- Forwarded message --
> > From: Vender Livre
> > Date: Wed, May 14, 2008 at 5:59 PM
> > Subject: Grouping products
> > To: [EMAIL PROTECTED]
> >
> >
> > Hi, I'm working in a software that must group similar products.
> >
> > For example:
> >
> > CANON IP1300
> >
> > PRINTER CANON IP 1300
> >
> > IP1300 CANON PRINTER BLACK
> >
> > the app should group these three names, because they are the same
> product.
> > Someone told me SOLR should solve my problem. Is this true? Where could I
> > learn more about it?
> >
> > Thanks
>
>


Re: Duplicates results when using a non optimized index

2008-05-14 Thread Otis Gospodnetic
Tim,

Hm, not sure what caused this.  1.2 is now quite old (yes, I know it's the last 
stable release), so if I were you I would consider moving to 1.3-dev.  It 
sounds like the index is already "polluted" with duplicate documents, so you'll 
want to rebuild the index whether you decide to stay with 1.2 or move to 
1.3-dev.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Tim Mahy <[EMAIL PROTECTED]>
> To: "solr-user@lucene.apache.org" 
> Sent: Wednesday, May 14, 2008 3:59:23 AM
> Subject: RE: Duplicates results when using a non optimized index
> 
> Hi,
> 
> thanks for the answer,
> 
> - do duplicates go away after optimization is done?
> --> no, if we search the index even after it is optimized, we still get the 
> duplicate results and even if we search on one of the slaves servers  which 
> have 
> the same index through synchronization ...
> btw this is the first time we notice this, the only thing we have had was the 
> known problem with the "too many open files" which we fixed using the ulimit 
> and 
> rebooted the tomcat server 
> 
> - do duplicate IDs that you are seeing IDs of previously deleted documents?
> --> it is possible that these documenst were uploaded earlier and have been 
> replaced...
> 
> - which Solr version are you using and can you try a recent nightly?
> --> we use the 1.2 stable build
> 
> greetings,
> Tim
> 
> Van: Otis Gospodnetic [EMAIL PROTECTED]
> Verzonden: woensdag 14 mei 2008 6:11
> Aan: solr-user@lucene.apache.org
> Onderwerp: Re: Duplicates results when using a non optimized index
> 
> Hm, not sure why that is happening, but here is some info regarding other 
> stuff 
> from your email
> 
> - there should be no duplicates even if you are searching an index that is 
> being 
> optimized
> - why are you searching an index that is being optimized?  It's doable, but 
> people typically perform index-modifying operations on a Solr master and 
> read-only operations on Solr query slave(s)
> - do duplicates go away after optimization is done?
> - do duplicate IDs that you are seeing IDs of previously deleted documents?
> - which Solr version are you using and can you try a recent nightly?
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: Tim Mahy 
> > To: "solr-user@lucene.apache.org" 
> > Sent: Tuesday, May 13, 2008 5:59:28 AM
> > Subject: Duplicates results when using a non optimized index
> >
> > Hi all,
> >
> > is this expected behavior when having an index like this :
> >
> > numDocs : 9479963
> > maxDoc : 12622942
> > readerImpl : MultiReader
> >
> > which is in the process of optimizing that when we search through the index 
> > we
> > get this :
> >
> >
> > 15257559
> >
> >
> > 15257559
> >
> >
> > 17177888
> >
> >
> > 11825631
> >
> >
> > 11825631
> >
> >
> > The id field is declared like this :
> >
> >
> > and is set as the unique identity like this in the schema xml :
> >   id
> >
> > so the question : is this expected behavior and if so is there a way to let 
> Solr
> > only return unique documents ?
> >
> > greetings and thanx in advance,
> > Tim
> >
> >
> >
> >
> > Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx
> 
> 
> 
> 
> 
> Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx



Re: Fwd: Grouping products

2008-05-14 Thread Otis Gospodnetic
Hi Vender,

Solr can't do the grouping for you.  Solr can do the searching/finding for you, 
but it won't be able to recognize different model names and figure out which 
ones represent the same product.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Vender Livre <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 5:01:01 PM
> Subject: Fwd: Grouping products
> 
> -- Forwarded message --
> From: Vender Livre 
> Date: Wed, May 14, 2008 at 5:59 PM
> Subject: Grouping products
> To: [EMAIL PROTECTED]
> 
> 
> Hi, I'm working in a software that must group similar products.
> 
> For example:
> 
> CANON IP1300
> 
> PRINTER CANON IP 1300
> 
> IP1300 CANON PRINTER BLACK
> 
> the app should group these three names, because they are the same product.
> Someone told me SOLR should solve my problem. Is this true? Where could I
> learn more about it?
> 
> Thanks



Re: Stop words and exact phrase

2008-05-14 Thread Otis Gospodnetic
You can use : syntax to specify the field to search.  For 
example:

title:"who moved my cheese"

There is nothing in Solr that would let you instruct it to send phrase queries 
to one field, and other queries to other field(s).  However, you can add that 
logic to your application and alter the query by prepending the appropriate 
field name before sending the query to Solr.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: cricdigs <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 1:54:15 PM
> Subject: Re: Stop words and exact phrase
> 
> 
> Hi wunder,
> 
> Thanks for your response. I am still a little confused. Solr's analysis page
> shows that the stop word is removed from the query - its got nothing to do
> with the indexing imo.
> 
> If indexing has removed the stop words then I should not get any results
> right? But I get the results with the stop word removed. 
> 
> How do I tell Solr to send phrase queries to a field other than default?
> Will I have to code that or is it just a config setting?
> 
> Thanks.
> 
> 
> Walter Underwood wrote:
> > 
> > Try creating a separate field that does not remove stopwords,
> > populating that with and configuring the phrase
> > queries to go against that field instead.
> > 
> > I do something similar. For both regular and phrase queries,
> > we have a stemmed and stopped field and another field with
> > neither. The "exact" field has a higher boost. This helps
> > with movies like "Saw" and "Ran", which should not show
> > "see" or "run" as the top match.
> > 
> > wunder
> > 
> > On 5/14/08 8:09 AM, "cricdigs" wrote:
> > 
> >> 
> >> Hi all,
> >> 
> >> Is there a config setting that I could use to not remove stop words when
> >> doing an exact phrase match. For example when searching for "the world"
> >> (in
> >> quotes) I would like to look for just that and not get results for just
> >> "world". When I look at the analysis, I see that word "the" is removed by
> >> the StopFilter even if it is in quotes. So is there a work-around to
> >> solve
> >> this?
> >> 
> >> Thanks!
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Stop-words-and-exact-phrase-tp17233404p17237198.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Help with Solr + KStem

2008-05-14 Thread Otis Gospodnetic
Hung,

You included the KStem jar itself, and that is good, but class 
KStemFilterFactory does not exist anywhere in Solr.
You need to get it from here:
https://issues.apache.org/jira/browse/SOLR-379

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Hung Huynh <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 3:57:29 PM
> Subject: Help with Solr + KStem
> 
> 
> I have KStem.jar in solr/lib and solr/example/lib and made a change to
> schema.xml to include the KStem line (removed the Porter line).
> 
> 
> 
> 
> This is what I get when I try to hit the Solr Admin page. How can I go about
> resolving this error?
> 
> Thanks,
> 
> HH
> 
> ---
> 
> 
> HTTP ERROR: 500
> Severe errors in solr configuration.
> 
> Check your log files for more detailed infomation on what may be wrong.
> 
> If you want solr to continue after configuration errors, change: 
> 
> false
> 
> in solrconfig.xml
> 
> -
> org.apache.solr.core.SolrException: Error loading class
> 'solr.KStemFilterFactory'
> at org.apache.solr.core.Config.findClass(Config.java:220)
> at org.apache.solr.core.Config.newInstance(Config.java:225)
> at
> org.apache.solr.schema.IndexSchema.readTokenFilterFactory(IndexSchema.java:6
> 29)
> at
> org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:607)
> at
> org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331)
> at org.apache.solr.schema.IndexSchema.(IndexSchema.java:71)
> at org.apache.solr.core.SolrCore.(SolrCore.java:196)
> at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:177)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
> at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
> at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
> at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
> at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
> at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
> at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
> at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:1
> 47)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCol
> lection.java:161)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
> at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:1
> 47)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
> at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
> at org.mortbay.jetty.Server.doStart(Server.java:210)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
> at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.mortbay.start.Main.invokeMain(Main.java:183)
> at org.mortbay.start.Main.start(Main.java:497)
> at org.mortbay.start.Main.main(Main.java:115)
> Caused by: java.lang.ClassNotFoundException: solr.KStemFilterFactory
> at java.net.URLClassLoader$1.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(Unknown Source)
> at java.lang.ClassLoader.loadClass(Unknown Source)
> at java.lang.ClassLoader.loadClass(Unknown Source)
> at
> org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:
> 375)
> at
> org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:
> 337)
> at java.lang.ClassLoader.loadClassInternal(Unknown Source)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Unknown Source)
> at org.apache.solr.core.Config.findClass(Config.java:204)
> ... 32 more
> 
> RequestURI=/solr/admin
> 
> Powered by Jetty://



Re: result count query

2008-05-14 Thread Otis Gospodnetic
I think specifying rows=0 in the URL gets you that number without giving you 
the actual results.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: solr_user <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 4:53:05 PM
> Subject: result count query
> 
> 
> Hi,
> 
>   Is there an efficient way to just get the result count of a query issued
> to Solr?
> 
> Solr-user
> -- 
> View this message in context: 
> http://www.nabble.com/result-count-query-tp17240159p17240159.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Chinese Language + Solr

2008-05-14 Thread Otis Gospodnetic
There are no free morphological analyzers for Chinese (are there for any 
language?) that I know.  People tend to use one of the n-gram analyzers from 
Lucene contrib.  I've used them before and they do OK.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Francisco Sanmartin <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, May 14, 2008 4:54:05 PM
> Subject: Chinese Language + Solr
> 
> I have had successful experiences using Sorl with an English website, 
> and now I am going to deploy Solr in a chinese site. I've been looking 
> in the mailing list and there are some useful information in the old posts.
> But, we would like some kind of feedback of the people who already have 
> deployed Solr in any CJK Language.
> 
> Is there any free and good analyzer? (Preferible morphological)
> Among all the commercial analyzers, what would you recommend? Is there 
> any of them that works ok out-of-the-box with Solr?
> 
> Thanks in advance.
> 
> Pako



Fwd: Grouping products

2008-05-14 Thread Vender Livre
-- Forwarded message --
From: Vender Livre <[EMAIL PROTECTED]>
Date: Wed, May 14, 2008 at 5:59 PM
Subject: Grouping products
To: [EMAIL PROTECTED]


Hi, I'm working in a software that must group similar products.

For example:

CANON IP1300

PRINTER CANON IP 1300

IP1300 CANON PRINTER BLACK

the app should group these three names, because they are the same product.
Someone told me SOLR should solve my problem. Is this true? Where could I
learn more about it?

Thanks


Chinese Language + Solr

2008-05-14 Thread Francisco Sanmartin
I have had successful experiences using Sorl with an English website, 
and now I am going to deploy Solr in a chinese site. I've been looking 
in the mailing list and there are some useful information in the old posts.
But, we would like some kind of feedback of the people who already have 
deployed Solr in any CJK Language.


Is there any free and good analyzer? (Preferible morphological)
Among all the commercial analyzers, what would you recommend? Is there 
any of them that works ok out-of-the-box with Solr?


Thanks in advance.

Pako


result count query

2008-05-14 Thread solr_user

Hi,

  Is there an efficient way to just get the result count of a query issued
to Solr?

Solr-user
-- 
View this message in context: 
http://www.nabble.com/result-count-query-tp17240159p17240159.html
Sent from the Solr - User mailing list archive at Nabble.com.



Help with Solr + KStem

2008-05-14 Thread Hung Huynh

I have KStem.jar in solr/lib and solr/example/lib and made a change to
schema.xml to include the KStem line (removed the Porter line).




This is what I get when I try to hit the Solr Admin page. How can I go about
resolving this error?

Thanks,

HH

---


HTTP ERROR: 500
Severe errors in solr configuration.

Check your log files for more detailed infomation on what may be wrong.

If you want solr to continue after configuration errors, change: 

false

in solrconfig.xml

-
org.apache.solr.core.SolrException: Error loading class
'solr.KStemFilterFactory'
at org.apache.solr.core.Config.findClass(Config.java:220)
at org.apache.solr.core.Config.newInstance(Config.java:225)
at
org.apache.solr.schema.IndexSchema.readTokenFilterFactory(IndexSchema.java:6
29)
at
org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:607)
at
org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:71)
at org.apache.solr.core.SolrCore.(SolrCore.java:196)
at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:177)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:1
47)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCol
lection.java:161)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:1
47)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.lang.ClassNotFoundException: solr.KStemFilterFactory
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at
org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:
375)
at
org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:
337)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at org.apache.solr.core.Config.findClass(Config.java:204)
... 32 more

RequestURI=/solr/admin

Powered by Jetty://



Re: solr highlighting

2008-05-14 Thread Mike Klaas

The minimum "stuff" needed to highlight term X in field F is:

field F must be 'stored'
field F must have an analyzer defined
a query with term X is sent (e.g., q=X)
with parameters hl=true (or 'on'), hl.fl=F

Try it on the example:
1. get the example running
2. cd example/exampledocs
3. ./post.sh *.xml
4. execute a query:

http://localhost:8983/solr/select?indent=on&version=2.2&q=solr&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=features

-Mike

On 14-May-08, at 9:39 AM, Kevin Xiao wrote:

Thanks Christian. I did try many options indicated in wiki, didn't  
work. So I want to see if the basics work, i.e. only define hl=true  
and a field for hl.fl. Do I need to include something global to make  
hl settings work?


Thanks,
- Kevin

-Original Message-
From: Christian Vogler [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 14, 2008 5:55 AM
To: solr-user@lucene.apache.org
Subject: Re: solr highlighting

On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:

Hi there,

I am new to solr. I want search term to be highlighted on the  
results. I
thought it is pretty simple, but could not make it work. I read a  
lot of
solr documents and mail archives (I wish there is a search function  
for

this, we are talking about solr, aren’t we? ☺).


Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as  
per

http://wiki.apache.org/solr/HighlightingParameters.

In particular, setting hl.fragsize to 0 might be what you want if I  
understand

your question correctly.

Best regards
- Christian
--
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece
http://gri.gallaudet.edu/~cvogler/
[EMAIL PROTECTED]




Re: Release date of SOLR 1.3

2008-05-14 Thread Matthew Runo
There isn't a specific date so far, but I'd like to say that only once  
in the year or so I've been working with the SVN head build of Solr  
have I noticed a bug get committed. And it was fixed very quickly once  
it was found.. I think if you need to have development features you're  
probably safe to use the SVN head, but remember that it is dev, and  
you should *always* test new builds before actually using them =p


Thanks!

Matthew Runo
Software Developer
Zappos.com
702.943.7833

On May 14, 2008, at 9:08 AM, Umar Shah wrote:

Hi,

I'm using the latest trunk code from SOLR .
I am basically using function queries (sum, product, scale) for my  
project

which are not present in 1.2.
I wanted to know if there is some decided date for release of Solr1.3.
If the date is far/ not decide, what should be the best practice to  
adopt

the above mentioned feature while not compromising on stability of the
system.

thanks
-umar




Re: Stop words and exact phrase

2008-05-14 Thread cricdigs

Hi wunder,

Thanks for your response. I am still a little confused. Solr's analysis page
shows that the stop word is removed from the query - its got nothing to do
with the indexing imo.

If indexing has removed the stop words then I should not get any results
right? But I get the results with the stop word removed. 

How do I tell Solr to send phrase queries to a field other than default?
Will I have to code that or is it just a config setting?

Thanks.


Walter Underwood wrote:
> 
> Try creating a separate field that does not remove stopwords,
> populating that with  and configuring the phrase
> queries to go against that field instead.
> 
> I do something similar. For both regular and phrase queries,
> we have a stemmed and stopped field and another field with
> neither. The "exact" field has a higher boost. This helps
> with movies like "Saw" and "Ran", which should not show
> "see" or "run" as the top match.
> 
> wunder
> 
> On 5/14/08 8:09 AM, "cricdigs" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Hi all,
>> 
>> Is there a config setting that I could use to not remove stop words when
>> doing an exact phrase match. For example when searching for "the world"
>> (in
>> quotes) I would like to look for just that and not get results for just
>> "world". When I look at the analysis, I see that word "the" is removed by
>> the StopFilter even if it is in quotes. So is there a work-around to
>> solve
>> this?
>> 
>> Thanks!
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Stop-words-and-exact-phrase-tp17233404p17237198.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: solr highlighting

2008-05-14 Thread Kevin Xiao
Thanks Christian. I did try many options indicated in wiki, didn't work. So I 
want to see if the basics work, i.e. only define hl=true and a field for hl.fl. 
Do I need to include something global to make hl settings work?

Thanks,
- Kevin

-Original Message-
From: Christian Vogler [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 14, 2008 5:55 AM
To: solr-user@lucene.apache.org
Subject: Re: solr highlighting

On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:
> Hi there,
>
> I am new to solr. I want search term to be highlighted on the results. I
> thought it is pretty simple, but could not make it work. I read a lot of
> solr documents and mail archives (I wish there is a search function for
> this, we are talking about solr, aren’t we? ☺).

Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as per
http://wiki.apache.org/solr/HighlightingParameters.

In particular, setting hl.fragsize to 0 might be what you want if I understand
your question correctly.

Best regards
- Christian
--
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece
http://gri.gallaudet.edu/~cvogler/
[EMAIL PROTECTED]


Re: exception while feeding converted text from pdf

2008-05-14 Thread Shalin Shekhar Mangar
Yes, you need to XML encode your test. If you use SolrJ to add documents to
Solr, it will take care of the encoding for you.

On Wed, May 14, 2008 at 9:53 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I made a simple java program to convert my pdfs to text, and then to xml
> file.
> I am getting a strange exception. I think the converted files have some
> errors. should I encode the txt string that I extract from the pdfs in a
> special way?
>
> Best,
> -C.B.
>
> EVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can
> not
> start with character ' ' (position: START_TAG seen
> ...ay\n  latitude 59& ...
> @80:64)
>at org.xmlpull.mxp1.MXParser.parseEntityRef(MXParser.java:2212)
>at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1275)
>at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
>at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
>at
>
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
>at
>
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
>at
>
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
>at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
>at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
>
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
>
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
>
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
>
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
>
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>



-- 
Regards,
Shalin Shekhar Mangar.


exception while feeding converted text from pdf

2008-05-14 Thread Cam Bazz
Hello,

I made a simple java program to convert my pdfs to text, and then to xml
file.
I am getting a strange exception. I think the converted files have some
errors. should I encode the txt string that I extract from the pdfs in a
special way?

Best,
-C.B.

EVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can not
start with character ' ' (position: START_TAG seen
...ay\n  latitude 59& ...
@80:64)
at org.xmlpull.mxp1.MXParser.parseEntityRef(MXParser.java:2212)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1275)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
at
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


Release date of SOLR 1.3

2008-05-14 Thread Umar Shah
Hi,

I'm using the latest trunk code from SOLR .
I am basically using function queries (sum, product, scale) for my project
which are not present in 1.2.
I wanted to know if there is some decided date for release of Solr1.3.
If the date is far/ not decide, what should be the best practice to adopt
the above mentioned feature while not compromising on stability of the
system.

thanks
-umar


Re: Stop words and exact phrase

2008-05-14 Thread Walter Underwood
Try creating a separate field that does not remove stopwords,
populating that with  and configuring the phrase
queries to go against that field instead.

I do something similar. For both regular and phrase queries,
we have a stemmed and stopped field and another field with
neither. The "exact" field has a higher boost. This helps
with movies like "Saw" and "Ran", which should not show
"see" or "run" as the top match.

wunder

On 5/14/08 8:09 AM, "cricdigs" <[EMAIL PROTECTED]> wrote:

> 
> Hi all,
> 
> Is there a config setting that I could use to not remove stop words when
> doing an exact phrase match. For example when searching for "the world" (in
> quotes) I would like to look for just that and not get results for just
> "world". When I look at the analysis, I see that word "the" is removed by
> the StopFilter even if it is in quotes. So is there a work-around to solve
> this?
> 
> Thanks!



Stop words and exact phrase

2008-05-14 Thread cricdigs

Hi all,

Is there a config setting that I could use to not remove stop words when
doing an exact phrase match. For example when searching for "the world" (in
quotes) I would like to look for just that and not get results for just
"world". When I look at the analysis, I see that word "the" is removed by
the StopFilter even if it is in quotes. So is there a work-around to solve
this?

Thanks!
-- 
View this message in context: 
http://www.nabble.com/Stop-words-and-exact-phrase-tp17233404p17233404.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr highlighting

2008-05-14 Thread Christian Vogler
On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:
> Hi there,
>
> I am new to solr. I want search term to be highlighted on the results. I
> thought it is pretty simple, but could not make it work. I read a lot of
> solr documents and mail archives (I wish there is a search function for
> this, we are talking about solr, aren’t we? ☺).

Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as per 
http://wiki.apache.org/solr/HighlightingParameters.

In particular, setting hl.fragsize to 0 might be what you want if I understand 
your question correctly.

Best regards
- Christian
-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece
http://gri.gallaudet.edu/~cvogler/
[EMAIL PROTECTED]


Re: help for preprocessing the query

2008-05-14 Thread Umar Shah
On Tue, May 13, 2008 at 5:04 PM, Umar Shah <[EMAIL PROTECTED]> wrote:

>
>
>
>
> On Tue, May 13, 2008 at 4:39 PM, Shalin Shekhar Mangar <
> [EMAIL PROTECTED]> wrote:
>
>> Did you put a filter-mapping in web.xml?
>
>
> no,
> I just did that and it seems to be working...
>

thanks for all the help folks, this community really ROCKS!!
I just implemented my filter successfully... and in doing so also got
introduced to the servlet world.


P.S: I wouldn't have asked unrelated questions here had i been able to
discern the difference.

thanks again.




>
> what is filter-mapping required for?
>
>
>>
>> On Tue, May 13, 2008 at 4:20 PM, Umar Shah <[EMAIL PROTECTED]> wrote:
>>
>> > On Mon, May 12, 2008 at 10:30 PM, Shalin Shekhar Mangar <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> > > You'll *not* write a servlet. You'll write implement the Filter
>> > interface
>> > >
>> http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/Filter.html
>> > >
>> > > In the doFilter method, you'll create a ServletRequestWrapper which
>> > > changes
>> > > the incoming param. Then you'll call chain.doFilter with the new
>> request
>> > > object. You'll need to add this filter before the SolrRequestFilter in
>> > > Solr's web.xml
>> >
>> > I created a CustomFilter that would dump the request contents to a file,
>> > I created the jar and added it to the solr.war in WEB_INF/lib folder
>> > I edited the web.xml in the same folder to include the following lines:
>> > 
>> >CustomFilter
>> >(packagename).CustomFilter
>> >  
>> >
>> > where CustomFilter is the name of the class extending
>> > javax.servlet.Filter.
>> >
>> > I don't see anything in the contents of the file..
>> >
>> > thanks for your help
>> > -umar
>> >
>> >
>> > >
>> > > Look at
>> > >
>> > >
>> >
>> http://www.onjava.com/pub/a/onjava/2001/05/10/servlet_filters.html?page=1for
>> > > more details.
>> > >
>> > > On Mon, May 12, 2008 at 8:51 PM, Umar Shah <[EMAIL PROTECTED]>
>> wrote:
>> > >
>> > > > On Mon, May 12, 2008 at 8:42 PM, Shalin Shekhar Mangar <
>> > > > [EMAIL PROTECTED]> wrote:
>> > > >
>> > > > > ServletRequest and ServletRequestWrapper are part of the Java
>> > > > servlet-api
>> > > > > (not Solr). Basically, Koji is hinting at writing a ServletFilter
>> > > > > implementation (again using servlet-api) and creating a wrapper
>> > > > > ServletRequest which modifies the underlying request params which
>> > can
>> > > > then
>> > > > > be used by Solr.
>> > > > >
>> > > >
>> > > > sorry for the silly question, basically i am new to servlets.
>> > > > Now If my understanding is right , I will need to create a
>> > > servlet/wrapper
>> > > > that would listen the user facing queries and then pass the
>> processed
>> > > text
>> > > > to solr request handler and I need to pack this servlet class file
>> > into
>> > > > Solr
>> > > > war file.
>> > > >
>> > > > But How would I ensure that my servlet is called instead of solr
>> > request
>> > > > handler?
>> > > >
>> > > >
>> > > > > On Mon, May 12, 2008 at 8:36 PM, Umar Shah <[EMAIL PROTECTED]>
>> > wrote:
>> > > > >
>> > > > > > On Mon, May 12, 2008 at 2:50 PM, Koji Sekiguchi <
>> > [EMAIL PROTECTED]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Umar,
>> > > > > > >
>> > > > > > > You may be able to preprocess your request parameter in your
>> > > > > > > servlet filter. In the doFilter() method, you do:
>> > > > > > >
>> > > > > > > ServletRequest myRequest = new MyServletRequestWrapper(
>> request
>> > );
>> > > > > >
>> > > > > >
>> > > > > > Thanks for your response,
>> > > > > >
>> > > > > > Where is the ServletRequest class , I am using Solr 1.3 trunk
>> code
>> > > > > > found SolrServletm, butit is depricated, which class can I use
>> > > instead
>> > > > > of
>> > > > > > SolrRequest in 1.3 codebase?
>> > > > > >
>> > > > > >
>> > > > > > I also tried overloading Standard request handler , How do I re
>> > > write
>> > > > > > queryparams there?
>> > > > > >
>> > > > > > Can you point me to some documentation?
>> > > > > >
>> > > > > >
>> > > > > > >   :
>> > > > > > > chain.doFilter( myRequest, response );
>> > > > > > >
>> > > > > > > And you have MyServletRequestWrapper that extends
>> > > > > ServletRequestWrapper.
>> > > > > > > Then you can get|set q* parameters through getParameter()
>> > method.
>> > > > > > >
>> > > > > > > Hope this helps,
>> > > > > > >
>> > > > > > > Koji
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Umar Shah wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > Due some requirement I need to transform the user queries
>> > before
>> > > > > > passing
>> > > > > > > > it
>> > > > > > > > to the standard handler in Solr,  can anyone suggest me the
>> > best
>> > > > way
>> > > > > > to
>> > > > > > > > do
>> > > > > > > > this.
>> > > > > > > >
>> > > > > > > > I will need to use a transfomation class that would provide
>> > > > > functions
>> > > > > > to
>> > > > > > > > process the input query 'qIn' and transform it to t

RE: How Special Character '&' used in indexing

2008-05-14 Thread Steven A Rowe
Hi Ricky,

Mike and wunder, neither of whom are newbies, were trying to educate you about 
this mailing list's etiquette (generally accepted rules of conduct).  While you 
may *think* that two regular correspondents on this list are just hassling you, 
I expect that you will find it difficult to continue obtaining assistance here 
should you persist in ignoring list members' advice about this matter.

BTW, I complete agree with Mike and wunder that these two carry the same 
annoying weight:

  Please kindly reply ASAP.
  Reply ASAP.

Saying "please kindly" in front of something that is not considered acceptable 
does not make it more acceptable.

Steve

On 05/14/2008 at 4:07 AM, Ricky wrote:
> I dont know whats your problem when the people who had
> answered my question
> had no issues. Please dont Spam anymore !
> 
> Thanks,
> Ricky.
> 
> On Tue, May 13, 2008 at 11:09 AM, Walter Underwood
> <[EMAIL PROTECTED]> wrote:
> 
> > "ASAP" means "As Soon As Possible", not "As Soon As Convenient".
> > Please don't say that if you don't mean it. --wunder
> > 
> > On 5/12/08 6:48 AM, "Ricky" <[EMAIL PROTECTED]> wrote:
> > 
> > > Hi Mike,
> > > 
> > > Thanx for your reply. I have got the answer to the question posted.
> > > 
> > > I know people are donating time here. ASAP doesnt mean that am
> > > demanding them to reply fast. Please read the lines before you comment
> > > something(*Please kindly* reply ASAP). Am a newbie and with curiosity
> > > i have requested to answer. I dont know if it has hurt you(Am sorry
> > > for that)
> > > 
> > > Thanks,
> > > Ricky.
> > > 
> > > 
> > > On Fri, May 9, 2008 at 3:30 PM, Mike Klaas <[EMAIL PROTECTED]>
> > > wrote:
> > > 
> > > > 
> > > > On 9-May-08, at 6:26 AM, Ricky wrote:
> > > > 
> > > > > I have tried sending the '&' instead of '&' like the following,
> > > > > A & K Inc.
> > > >  
> > > > > But i still get the same error ""entity reference name can not
> > > > > contain character  '  "company">A
> > &
> > > > > ..
> > > > > 
> > > > 
> > > > Please use a library for doing xml encoding--there is
> absolutely no
> > reason
> > > > to do this yourself.
> > > > 
> > > >  Please kindly reply ASAP.
> > > >  
> > > > 
> > > > Please also realize that people responding here are donating their
> > > > time and that it is inappropriate to ask for an expedited response.
> > > > 
> > > > -Mike
> > > > 
> > > > 
> > 
> > 
>

 



Re: Loading performance slowdown at ~ 400K documents

2008-05-14 Thread David Pratt
Hi Tracy. I appreciate your taking the time to provide this. Overall, it 
is helpful to see comparative information for boosting performance. Many 
thanks.


Regards
David

Tracy Flynn wrote:

David

The main content organization I index is some number of articles 
existing under a common title.


I have three SOLR instances containing:

- Instance 1 - All 'live' articles ~ 750K articles - 3-4KB each
- Instance 2 - All 'live' titles' - ~ 95K titles - < 1 KB each
- Instance 3 - All articles and titles ~ 1.2mm articles + titles

I create Instance 1 and Instance 2 to provide fast response for heavy 
query usage on 'live'  articles and 'live' titles.  I use Instance 3 for 
all low-volume, complex queries.


(All above as preamble)

My current JVM settings are

- Instance 1  - -Xms256m -Xmx2000m
- Instance 2 - -Xms256m -Xmx1000m
- Instance 3 - -Xms256m -Xmx2000m

I'm in the middle of tuning the application. These values reflect 
optimization for document indexing.  Haven't looked at the query side yet.


Notes

I'm using 'top' to look at process sizes (Redhat 4.x, 4 GB Xeon Dual core)

For instance 1, I could probably  get away with  -Xmx1000m - but I think 
it's just a matter of (a short) time until I need to increase that limit.
For instance 2, it currently runs in steady state at 1.2 - 1.4 GB max, 
so I boosted to 2 GB max.


Regards,

Tracy

On May 11, 2008, at 8:31 AM, David Pratt wrote:

Hi Tracy. Can you advise the sort of difference in max heap space that 
resulted in the improvement, that is, your before and after max heap 
space. Many thanks.


Regards,
David

Tracy Flynn wrote:

Thanks for the replies.
For a completely different reason, I happened to look at the memory 
stats for all processes including the SOLR instances. Noticed that 
the SLOW Solr instance was maxing out with more virtual memory than 
allocated. After boosting the maximum heap space and restarting, 
everything started to run at 4x-5x the speed before the fix - and at 
the rate I reasonably thought it should.

Tracy
On May 9, 2008, at 8:02 AM, Tracy Flynn wrote:

Hi,

I'm starting to see significant slowdown in loading performance 
after I have loaded about 400K documents.  I go from a load rate of 
near 40 docs/sec to 20- 25 docs a second.


Am I correct in assuming that, during indexing operations, 
Lucene/SOLR tries to hold as much of the indexex in memory as 
possible? If so, does the slowdown indicate need to increase JVM 
heap space?


Any ideas / help would be appreciated

Regards,

Tracy

- 



Details

Documents loaded as XML via POST command in batches of 1000, commit 
after each batch


Total current documents ~ 450,000
Avg document size: 4KB
One indexed text field contains 3KB or so. (body field below - 
standard type 'text')


Dual XEON 3 GHZ 4 GB memory

SOLR JVM Startup options

java -Xms256m -Xmx1000m  -jar start.jar


Relevant portion of the schema follows


 stored="true" required="true"/>
 required="false"/>
 required="false"/>
 
 stored="true" required="false" default="0"/>
 stored="true" required="true"/>
 required="false"/>
 required="false" compressed="true"/>
 required="false"/>
 stored="true" required="false" default="0"/>
 required="false"/>
 required="false" default="0"/>
 stored="true" required="false" default="0"/>
 required="false" default="0"/>
 required="false"/>
 required="false"/>
 stored="true" required="false" multiValued="true"/>
 required="false" default="0"/>
 stored="true" required="false" default="0"/>
 stored="true" required="false"/>
 stored="true" required="false" multiValued="true"/>
 stored="true" required="false"/>
 required="false" default="0"/>
 stored="true" required="false"/>
 required="false" default="0"/>
 stored="true" required="false"/>
 stored="true" required="false"/>
 indexed="true" stored="true" required="false"/>

  
 required="false" />












Re: indexing pdf documents

2008-05-14 Thread Brian Carmalt
Hello Cam,

The wiki for RichDocuments explains how you can add meta data to the
RDUpdater.  
http://wiki.apache.org/solr/UpdateRichDocuments

I have used the patch to index docs and thier meta data, but it was not 
exactly what we needed. 

Brian. 

Am Mittwoch, den 14.05.2008, 12:38 +0300 schrieb Cam Bazz:
> Hello Elizabeth;
> 
> Yes, I have PDF files, and metadata about them already extracted.
> 
> so I need something like:
> 
> 
> someone
>  content of my pdf file
> 
> 
> it seems that the updaterichdocument patch can only accept pdfs in raw form
> - so it is not possible to feed metadata.
> 
> Have you found a solution other then to manually convert pdf into txt then
> forming xmls?
> 
> Best Regards,
> -C.B.
> 
> On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote:
> 
> > C.B., are you saying you have metadata about your PDF files (i.e., title,
> > author, etc) separate from the PDF file itself, or are you saying you want
> > to extract that information from the PDF file? The first of these is pretty
> > easy, the second of these can be difficult or impossible, depending on how
> > your PDF file was generated and how consistent your files are.
> >
> > It's a bit of a hack, but I've had great success in the past with using
> > XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and
> > then pointing solr at the resulting lucene index.  It's worth checking to
> > see if this would do the trick for you.
> >
> > Bess
> >
> > Elizabeth (Bess) Sadler
> > Research and Development Librarian
> > Digital Scholarship Services
> > Box 400129
> > Alderman Library
> > University of Virginia
> > Charlottesville, VA 22904
> >
> >
> > On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
> >
> > > yes, I have seen the documentation on RichDocumentRequestHandler at the
> > > http://wiki.apache.org/solr/UpdateRichDocuments page.
> > > However, from what I understand this just feeds documents to solr. How
> > > can I
> > > construct something like: document_id, document_name, document_text and
> > > feed
> > > it in. (i.e. my documents have labels)
> > >
> > > Best.
> > > -C.B.
> > >
> > > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote:
> > >
> > >  Solr does not have this support built in, but there's a patch for it:
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-284
> > > >
> > > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > >  Before making a little program to extract the txt from my pdfs and
> > > > > feed
> > > > >
> > > > it
> > > >
> > > > >  into solr with xml, I just wanted to check if solr has capability
> > > > > to
> > > > >
> > > > digest
> > > >
> > > > >  pdf files apart from xml?
> > > > >
> > > > >  Best Regards,
> > > > >  -C.B.
> > > > >
> > > > >
> > > >
> >
> >
> >



Re: Differences between nightly builds

2008-05-14 Thread Lucas F. A. Teixeira

Thanks Otis!

[]s,

Lucas


Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018



Otis Gospodnetic escreveu:

Lucas,

Look at the solr svn repository's root and you will see a file name called 
CHANGES.txt.  That contains all major Solr changes back to January 2006.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Lucas F. A. Teixeira <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, May 13, 2008 6:59:55 AM
Subject: Differences between nightly builds

Hello,

Here we use a nightly build from aug '07. It`s what we need with some 
bugs that we`ve worked on it.
I want to change this to a newer nightly build, but as this is 'stable' 
people are affraid of changing to a 'unknown' build.


Is there some place where I can find all changes between some date (my 
aug 07') and nowadays? Maybe with this I can make their mind!


Thank you.

[]s,


--
Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018





  


Re: Loading performance slowdown at ~ 400K documents

2008-05-14 Thread Tracy Flynn

David

The main content organization I index is some number of articles  
existing under a common title.


I have three SOLR instances containing:

- Instance 1 - All 'live' articles ~ 750K articles - 3-4KB each
- Instance 2 - All 'live' titles' - ~ 95K titles - < 1 KB each
- Instance 3 - All articles and titles ~ 1.2mm articles + titles

I create Instance 1 and Instance 2 to provide fast response for heavy  
query usage on 'live'  articles and 'live' titles.  I use Instance 3  
for all low-volume, complex queries.


(All above as preamble)

My current JVM settings are

- Instance 1  - -Xms256m -Xmx2000m
- Instance 2 - -Xms256m -Xmx1000m
- Instance 3 - -Xms256m -Xmx2000m

I'm in the middle of tuning the application. These values reflect  
optimization for document indexing.  Haven't looked at the query side  
yet.


Notes

I'm using 'top' to look at process sizes (Redhat 4.x, 4 GB Xeon Dual  
core)


For instance 1, I could probably  get away with  -Xmx1000m - but I  
think it's just a matter of (a short) time until I need to increase  
that limit.
For instance 2, it currently runs in steady state at 1.2 - 1.4 GB max,  
so I boosted to 2 GB max.


Regards,

Tracy

On May 11, 2008, at 8:31 AM, David Pratt wrote:

Hi Tracy. Can you advise the sort of difference in max heap space  
that resulted in the improvement, that is, your before and after max  
heap space. Many thanks.


Regards,
David

Tracy Flynn wrote:

Thanks for the replies.
For a completely different reason, I happened to look at the memory  
stats for all processes including the SOLR instances. Noticed that  
the SLOW Solr instance was maxing out with more virtual memory than  
allocated. After boosting the maximum heap space and restarting,  
everything started to run at 4x-5x the speed before the fix - and  
at the rate I reasonably thought it should.

Tracy
On May 9, 2008, at 8:02 AM, Tracy Flynn wrote:

Hi,

I'm starting to see significant slowdown in loading performance  
after I have loaded about 400K documents.  I go from a load rate  
of near 40 docs/sec to 20- 25 docs a second.


Am I correct in assuming that, during indexing operations, Lucene/ 
SOLR tries to hold as much of the indexex in memory as possible?  
If so, does the slowdown indicate need to increase JVM heap space?


Any ideas / help would be appreciated

Regards,

Tracy

-

Details

Documents loaded as XML via POST command in batches of 1000,  
commit after each batch


Total current documents ~ 450,000
Avg document size: 4KB
One indexed text field contains 3KB or so. (body field below -  
standard type 'text')


Dual XEON 3 GHZ 4 GB memory

SOLR JVM Startup options

java -Xms256m -Xmx1000m  -jar start.jar


Relevant portion of the schema follows


 stored="true" required="true"/>
 required="false"/>
 stored="true" required="false"/>
 
 stored="true" required="false" default="0"/>
 stored="true" required="true"/>
 required="false"/>
 required="false" compressed="true"/>
 required="false"/>
 stored="true" required="false" default="0"/>
 required="false"/>
 required="false" default="0"/>
 stored="true" required="false" default="0"/>
 required="false" default="0"/>
 required="false"/>
 required="false"/>
 stored="true" required="false" multiValued="true"/>
 stored="true" required="false" default="0"/>
 stored="true" required="false" default="0"/>
 stored="true" required="false"/>
 stored="true" required="false" multiValued="true"/>
 stored="true" required="false"/>
 required="false" default="0"/>
 stored="true" required="false"/>
 stored="true" required="false" default="0"/>
 indexed="false" stored="true" required="false"/>
 indexed="true" stored="true" required="false"/>
 indexed="true" stored="true" required="false"/>

  
 stored="true" required="false" />









Re: indexing pdf documents

2008-05-14 Thread Cam Bazz
Hello Elizabeth;

Yes, I have PDF files, and metadata about them already extracted.

so I need something like:


someone
content of my pdf file


it seems that the updaterichdocument patch can only accept pdfs in raw form
- so it is not possible to feed metadata.

Have you found a solution other then to manually convert pdf into txt then
forming xmls?

Best Regards,
-C.B.

On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote:

> C.B., are you saying you have metadata about your PDF files (i.e., title,
> author, etc) separate from the PDF file itself, or are you saying you want
> to extract that information from the PDF file? The first of these is pretty
> easy, the second of these can be difficult or impossible, depending on how
> your PDF file was generated and how consistent your files are.
>
> It's a bit of a hack, but I've had great success in the past with using
> XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and
> then pointing solr at the resulting lucene index.  It's worth checking to
> see if this would do the trick for you.
>
> Bess
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
>
> On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
>
> > yes, I have seen the documentation on RichDocumentRequestHandler at the
> > http://wiki.apache.org/solr/UpdateRichDocuments page.
> > However, from what I understand this just feeds documents to solr. How
> > can I
> > construct something like: document_id, document_name, document_text and
> > feed
> > it in. (i.e. my documents have labels)
> >
> > Best.
> > -C.B.
> >
> > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote:
> >
> >  Solr does not have this support built in, but there's a patch for it:
> > >
> > > https://issues.apache.org/jira/browse/SOLR-284
> > >
> > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hello,
> > > >
> > > >  Before making a little program to extract the txt from my pdfs and
> > > > feed
> > > >
> > > it
> > >
> > > >  into solr with xml, I just wanted to check if solr has capability
> > > > to
> > > >
> > > digest
> > >
> > > >  pdf files apart from xml?
> > > >
> > > >  Best Regards,
> > > >  -C.B.
> > > >
> > > >
> > >
>
>
>


Re[2]: the time factor

2008-05-14 Thread JLIST
Hello Otis,

Got it. I'll take a look. Thanks for spending so much
time helping others out!

Jack

Tuesday, May 13, 2008, 9:06:18 PM, you wrote:

> Jack,

> The answer is: function queries! :)
> You can easily use function queries with DisMaxRequestHandler. 
> For example, this is what you can add to the dismax config section
> in solrconfig.xml:

>  
> recip(rord(addDate),1,1000,1000)^2.5
>  

> Assuming you have an addDate field, this will give fresher
> document some boost.  Look for this on the Wiki, it's all there.

> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


> - Original Message 
>> From: JLIST <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, May 13, 2008 5:42:38 AM
>> Subject: the time factor
>> 
>> Hi,
>> 
>> I'm indexing news articles from a few news feeds.
>> With news, there's the factor of relevance and also the
>> factor of freshness. Relevance-only results are not satisfactory.
>> Sorting on feed update time is not satisfactory, either,
>> because one source may update more frequently than the
>> others and it tends to occupy the first rows most of
>> the time. I wonder what is the best way of combining the
>> time factor in news search?
>> 
>> Thanks,
>> Jack





Re: How Special Character '&' used in indexing

2008-05-14 Thread Ricky
I dont know whats your problem when the people who had answered my question
had no issues. Please dont Spam anymore !

Thanks,
Ricky.

On Tue, May 13, 2008 at 11:09 AM, Walter Underwood <[EMAIL PROTECTED]>
wrote:

> "ASAP" means "As Soon As Possible", not "As Soon As Convenient".
> Please don't say that if you don't mean it. --wunder
>
> On 5/12/08 6:48 AM, "Ricky" <[EMAIL PROTECTED]> wrote:
>
> > Hi Mike,
> >
> > Thanx for your reply. I have got the answer to the question posted.
> >
> > I know people are donating time here. ASAP doesnt mean that am demanding
> > them to reply fast. Please read the lines before you comment
> something(*Please
> > kindly* reply ASAP). Am a newbie and with curiosity i have requested to
> > answer. I dont know if it has hurt you(Am sorry for that)
> >
> > Thanks,
> > Ricky.
> >
> >
> > On Fri, May 9, 2008 at 3:30 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> On 9-May-08, at 6:26 AM, Ricky wrote:
> >>
> >>  I have tried sending the '&' instead of '&' like the following,
> >>> A & K Inc.
> >>>
> >>> But i still get the same error ""entity reference name can not contain
> >>> character  ' A
> &
> >>> ..
> >>>
> >>
> >> Please use a library for doing xml encoding--there is absolutely no
> reason
> >> to do this yourself.
> >>
> >>  Please kindly reply ASAP.
> >>>
> >>
> >> Please also realize that people responding here are donating their time
> >> and that it is inappropriate to ask for an expedited response.
> >>
> >> -Mike
> >>
> >>
>
>


RE: Duplicates results when using a non optimized index

2008-05-14 Thread Tim Mahy
Hi,

thanks for the answer,

- do duplicates go away after optimization is done?
--> no, if we search the index even after it is optimized, we still get the 
duplicate results and even if we search on one of the slaves servers  which 
have the same index through synchronization ...
btw this is the first time we notice this, the only thing we have had was the 
known problem with the "too many open files" which we fixed using the ulimit 
and rebooted the tomcat server 

- do duplicate IDs that you are seeing IDs of previously deleted documents?
--> it is possible that these documenst were uploaded earlier and have been 
replaced...

- which Solr version are you using and can you try a recent nightly?
--> we use the 1.2 stable build

greetings,
Tim

Van: Otis Gospodnetic [EMAIL PROTECTED]
Verzonden: woensdag 14 mei 2008 6:11
Aan: solr-user@lucene.apache.org
Onderwerp: Re: Duplicates results when using a non optimized index

Hm, not sure why that is happening, but here is some info regarding other stuff 
from your email

- there should be no duplicates even if you are searching an index that is 
being optimized
- why are you searching an index that is being optimized?  It's doable, but 
people typically perform index-modifying operations on a Solr master and 
read-only operations on Solr query slave(s)
- do duplicates go away after optimization is done?
- do duplicate IDs that you are seeing IDs of previously deleted documents?
- which Solr version are you using and can you try a recent nightly?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Tim Mahy <[EMAIL PROTECTED]>
> To: "solr-user@lucene.apache.org" 
> Sent: Tuesday, May 13, 2008 5:59:28 AM
> Subject: Duplicates results when using a non optimized index
>
> Hi all,
>
> is this expected behavior when having an index like this :
>
> numDocs : 9479963
> maxDoc : 12622942
> readerImpl : MultiReader
>
> which is in the process of optimizing that when we search through the index we
> get this :
>
>
> 15257559
>
>
> 15257559
>
>
> 17177888
>
>
> 11825631
>
>
> 11825631
>
>
> The id field is declared like this :
>
>
> and is set as the unique identity like this in the schema xml :
>   id
>
> so the question : is this expected behavior and if so is there a way to let 
> Solr
> only return unique documents ?
>
> greetings and thanx in advance,
> Tim
>
>
>
>
> Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx





Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx