Getting the offset of search keyword in a document

2010-07-24 Thread Ryan Chan
Hello,

I am new to Solr/Lucene and I am evaluating if they suit my need and
replace our in-house system.


Our requirements:

1. I have multiple documents (1M)
2. Each document contains text ranged from few KB to a few MB
3. I want to search for a keyword, search thru all theses document,
and it return the matched document(s), AND ALSO the offset of that
'keyword' inside the document.

Is it possible for requirement 3?


Re: Tree Faceting in Solr 1.4

2010-07-24 Thread SR
Hi Geert-Jan,

What did you mean by this: 

 Also, just a suggestion, consider using id's instead of names for filtering;

Thanks,
-S

Re: a bug of solr distributed search

2010-07-24 Thread MitchK

Okay, but than LiLi did something wrong, right?

I mean, if the document exists only at one shard, it should get the same
score whenever one requests it, no?
Of course, this only applies if nothing gets changed between the requests.
The only remaining problem here would be, that you need distributed IDF
(like at the mentioned JIRA-issue) to normalize your results's scoring. 

But the mentioned problem at this mailing-list-posting has nothing to do
with that...

Regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p991907.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud in production?

2010-07-24 Thread Andrew Clegg

Is anyone using ZooKeeper-based Solr Cloud in production yet? Any war
stories? Any problematic missing features?

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-in-production-tp991995p991995.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
Perhaps completely unnessecery when you have a controlled domain, but I
meant to use ids for places instead of names, because names will quickly
become ambiguous, e.g.: there are numerous different places over the world
called washington, etc.

2010/7/24 SR r.steve@gmail.com

 Hi Geert-Jan,

 What did you mean by this:

  Also, just a suggestion, consider using id's instead of names for
 filtering;

 Thanks,
 -S


RE: Novice seeking help to change filters to search without diacritics

2010-07-24 Thread Steven A Rowe
Hi HSingh,

Usually people set up two fields, one with diacritics and one without.  Then 
searches are against both fields.  If you think a match against the field with 
diacritics is more valuable, you can give that field a boost.

Steve

 -Original Message-
 From: HSingh [mailto:hsin...@gmail.com]
 Sent: Friday, July 23, 2010 5:20 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Novice seeking help to change filters to search without
 diacritics
 
 
 Hi Steve,  This is extremely helpful!  What is the best way to also
 preserve/append the diacritics in the index in case someone searches using
 them?  I deeply appreciate your help!
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Novice-
 seeking-help-to-change-filters-to-search-without-diacritics-
 tp971263p990949.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance issues when querying on large documents

2010-07-24 Thread dc tech
Are you storing the full 1,000 pages in the index? If so, that is
probably not helping either.

On 7/23/10, ahammad ahmed.ham...@gmail.com wrote:

 Hello,

 I have an index with lots of different types of documents. One of those
 types basically contains extracts of PDF docs. Some of those PDFs can have
 1000+ pages, so there would be a lot of stuff to search through.

 I am experiencing really terrible performance when querying. My whole index
 has about 270k documents, but less than 1000 of those are the PDF extracts.
 The slow querying occurs when I search only on those PDF extracts (by
 specifying filters), and return 100 results. The 100 results definitely adds
 to the issue, but even cutting that down can be slow.

 Is there a way to improve querying with such large results? To give an idea,
 querying for a single word can take a little over a minute, which isn't
 really viable for an application that revolves around searching. For now, I
 have limited the results to 20, which makes the query execute in roughly
 10-15 seconds. However, I would like to have the option of returning 100
 results.

 Thanks a lot.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
 Sent from the Solr - User mailing list archive at Nabble.com.


-- 
Sent from my mobile device


RE: Tree Faceting in Solr 1.4

2010-07-24 Thread Jonathan Rochkind
 Perhaps completely unnessecery when you have a controlled domain, but I
 meant to use ids for places instead of names, because names will quickly
 become ambiguous, e.g.: there are numerous different places over the world
 called washington, etc.

This is related to something I've been thinking about. Okay, say you use ID's 
instead of names. Now, you've got to translate those ID's to names before you 
display them, of course. 

One way to do that would be to keep the id-to-name lookup in some non-solr 
store (rdbms, or non-sql store)

Is that what you'd do? Is there any non-crazy way to do that without an 
external store, just with solr?  Any way to do it with term payloads? Anything 
else?

Jonathan

Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Stefan Moises

Hi Jonathan,

I too am using IDs instead of names, one reason being that URLs are 
easier to read and they are more safe, because special chars in names 
could break the URLs etc.
I am keeping the id-to-name lookups in SOLR though, I just use some 
lookup fields where I put id and name into one field, separated by 
some fixed delimiter, e.g.

134982__Some name I am going to lookup later
The separator here would be two underscores (__).
So I can query for that lookup field, extract id and name and store them 
into an array or something to loop them up in my (PHP) frontend.
If you don't have too many different values you could also map 
id-to-name in a simple text file (as suggested in the SOLR book e.g.)


Cheers,
Stefan



Perhaps completely unnessecery when you have a controlled domain, but I
meant to use ids for places instead of names, because names will quickly
become ambiguous, e.g.: there are numerous different places over the world
called washington, etc.
 

This is related to something I've been thinking about. Okay, say you use ID's 
instead of names. Now, you've got to translate those ID's to names before you 
display them, of course.

One way to do that would be to keep the id-to-name lookup in some non-solr 
store (rdbms, or non-sql store)

Is that what you'd do? Is there any non-crazy way to do that without an 
external store, just with solr?  Any way to do it with term payloads? Anything 
else?

Jonathan
   


--
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



RE: Tree Faceting in Solr 1.4

2010-07-24 Thread Jonathan Rochkind
 I am keeping the id-to-name lookups in SOLR though, I just use some
 lookup fields where I put id and name into one field, separated by
 some fixed delimiter, e.g.
 134982__Some name I am going to lookup later
 The separator here would be two underscores (__).
 So I can query for that lookup field, extract id and name and store them
 into an array or something to loop them up in my (PHP) frontend.

Interesting, thanks. Do you use a prefix query, then, to find that value?

Still confused thinking about how this would work. Each of your documents has 
only one ID? 

In my case, its more like the geographic hieararchical stuff this thread began 
with. The ID is not the documents' ID, it's the ID of essentially a facet 
value, which can be multi-valued (or maybe even hierarchical). 

  Document X:  
 * United State
 * China
  Document Y
 * China
 * Russia

If we turn those actual values into ID__label strings... it gets confusing 
how to query for them. ESPECIALLY if we try to introduce the hierarchy into it. 

Document X:
* 1234__United States/677_Michigan/987_Detroit

I think actually trying to store things like that would break either of the 
techniques in the wiki page about hierarchical facetting. 

Maybe an external store is really the only way to go that doesn't turn into a 
mess. 
   

Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
I believe we use an in-process weakhashmap to store the id-name
relationship. It's not that we're talking billions of values here.
For anything more mem-intensive we use no-sql (tokyo tyrant through
memcached protocol at the moment)

2010/7/24 Jonathan Rochkind rochk...@jhu.edu

  Perhaps completely unnessecery when you have a controlled domain, but I
  meant to use ids for places instead of names, because names will quickly
  become ambiguous, e.g.: there are numerous different places over the
 world
  called washington, etc.

 This is related to something I've been thinking about. Okay, say you use
 ID's instead of names. Now, you've got to translate those ID's to names
 before you display them, of course.

 One way to do that would be to keep the id-to-name lookup in some non-solr
 store (rdbms, or non-sql store)

 Is that what you'd do? Is there any non-crazy way to do that without an
 external store, just with solr?  Any way to do it with term payloads?
 Anything else?

 Jonathan


Re: SolrCloud in production?

2010-07-24 Thread Dennis Gearon
Boy, if it does what it says it does, it's really a powerful tool. 

How is such a thing hosted, I wonder? 

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Sat, 7/24/10, Andrew Clegg andrew.cl...@gmail.com wrote:

 From: Andrew Clegg andrew.cl...@gmail.com
 Subject: SolrCloud in production?
 To: solr-user@lucene.apache.org
 Date: Saturday, July 24, 2010, 5:18 AM
 
 Is anyone using ZooKeeper-based Solr Cloud in production
 yet? Any war
 stories? Any problematic missing features?
 
 Thanks,
 
 Andrew.
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-in-production-tp991995p991995.html
 Sent from the Solr - User mailing list archive at
 Nabble.com.
 


Which is a good XPath generator?

2010-07-24 Thread Savannah Beckett
Hi,
  I am looking for a XPath generator that can generate xpath by picking a 
specific tag inside a html.  Do you know a good xpath generator?  If possible, 
free xpath generator would be great.
Thanks.


  

Re: Performance issues when querying on large documents

2010-07-24 Thread Erick Erickson
What are you returning? I'd be quite surprised if it was the search, so
first I'd look elsewhere. In particular, are you returning all 1,000 pages?
What happens if you specify returning a small field (the fl= parameter).

Also, look at the debug output of the query, it breaks down the various
phases of the query processing and that might give you a hint.

If none of that does the trick, please post the query and the relevant parts
of your schema as well as debug output...

Best
Erick

On Fri, Jul 23, 2010 at 2:52 PM, ahammad ahmed.ham...@gmail.com wrote:


 Hello,

 I have an index with lots of different types of documents. One of those
 types basically contains extracts of PDF docs. Some of those PDFs can have
 1000+ pages, so there would be a lot of stuff to search through.

 I am experiencing really terrible performance when querying. My whole index
 has about 270k documents, but less than 1000 of those are the PDF extracts.
 The slow querying occurs when I search only on those PDF extracts (by
 specifying filters), and return 100 results. The 100 results definitely
 adds
 to the issue, but even cutting that down can be slow.

 Is there a way to improve querying with such large results? To give an
 idea,
 querying for a single word can take a little over a minute, which isn't
 really viable for an application that revolves around searching. For now, I
 have limited the results to 20, which makes the query execute in roughly
 10-15 seconds. However, I would like to have the option of returning 100
 results.

 Thanks a lot.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Novice seeking help to change filters to search without diacritics

2010-07-24 Thread HSingh


: Usually people set up two fields, one with diacritics and one without.  
: Then searches are against both fields.  If you think a match against the
field 
: with diacritics is more valuable, you can give that field a boost. 

Hi Steve, where can one setup these two fields?  Thank you for your kind
assistance!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Novice-seeking-help-to-change-filters-to-search-without-diacritics-tp971263p993150.html
Sent from the Solr - User mailing list archive at Nabble.com.