Re: Structured Lucene documents

2007-10-12 Thread pgwillia

Hi All,

The Structured (or Multi-Page, Multi-Part) document problem is a problem
I've been thinking about for a while.  A couple of years ago when the
project I was working on was using Lucene only (no Solr), we solved this
problem in several steps.  At the point of ingestion we created a custom
analyzer and surrounding Java code that created a mapping for positions to
which page it is on (recall that analyzers tokenize the terms in a given
field and mark the position of the token).  This mapping was stored outside
of the Lucene index.  At query time, we used home built Java to pull the
position hits matching the query from the index and augmented the results
generated by Lucene.  At presentation time the results were molded into xml
and then transformed by several xsl sheets, one of which translated the
position hits to the page they were on using the information gleamed from
the ingestion stage. 

When we moved to Solr, we created a custom QueryResponseWriter in order to
get the position locations into the xml results and kept the same
transformation to obtain the page level hits.  The ingestion stage stays the
same -- so really we're using Lucene to build the index, but Solr sits on
top of it to serve results.

I admit this is an awkward hack.  Peter Binkley ([EMAIL PROTECTED])
who I worked with on the project made this suggested improvement:



> 
> "Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: . As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page:  firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> 
> 
> 234
> 236
> 
> 
> 19
> 
> 
> 
> 
> 
> 14325
> 
> 
> ...
> 
> 
> We have some code that does something like this in a Lucene context, which
> could form the basis for a Solr fieldtype; but it would probably be just
> as easy to start fresh.
> 
> 

My current project would like to have some meta data about each sub-part of
the document also included.  For example: each page would have a url, and/or
a title associated with the content.  This becomes  meaningful when we index
things like newspapers and monographs which may have page, chapter, or
section level content.So a solution would ideally have taken this into
consideration.
 
Does anyone with more experience know if this is a reasonable approach? 
Does an issue exist for this feature request?  Other comments or questions?

Thanks,
Tricia


Pierre-Yves LANDRON wrote:
> 
> Hello,Is it possible to structure lucene documents via Solr, so one
> document coud fit into another one ?What I would like to do, for example
> :I want to retrieve full text articles, that fit on several pages for each
> of them. Results must take in account both the pages and the article from
> wich the search terms are from. I can create a lucene document for each
> pages of the article AND the article itself, and do two requests to get my
> results, but it would duplicate the full text in the index, and will not
> be too efficient. Ideally, what I would like to do is to create a document
> for indexing the text of each pages of the article, and group these
> documents in one document that describe the article : this way, when
> Lucene retrieve a requested term, i'll get the article and the page that
> contains the term.I wonder if there's a way to emulate elegantly this
> behavior with Solr ?Kind Regards,Pierre-Yves Landron
> 

-- 
View this message in context: 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a13185053
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Will turning off the stored setting on a field remove it from the index?

2007-10-12 Thread Mike Klaas


On 12-Oct-07, at 4:39 PM, BrendanD wrote:

We have some fields that we're currently storing in the index (for  
example
product_name, short_description, etc). We'd like to stop storing  
them in the
index as we're going to start faulting them in from the database  
instead so

that the content is fresh.

If we change our config to stop storing them, when will they get  
removed
from the index? After the next commit? After an optimize? Or will  
we have to

rebuild the entire index from scratch?


The latter, I'm afraid.

Solr never modifies or implicitly changes existing documents due to  
config changes.


-Mike


Will turning off the stored setting on a field remove it from the index?

2007-10-12 Thread BrendanD

Hi,

We have some fields that we're currently storing in the index (for example
product_name, short_description, etc). We'd like to stop storing them in the
index as we're going to start faulting them in from the database instead so
that the content is fresh.

If we change our config to stop storing them, when will they get removed
from the index? After the next commit? After an optimize? Or will we have to
rebuild the entire index from scratch?

Thanks,

Brendan
-- 
View this message in context: 
http://www.nabble.com/Will-turning-off-the-stored-setting-on-a-field-remove-it-from-the-index--tf4616636.html#a13184863
Sent from the Solr - User mailing list archive at Nabble.com.



Re: dismax downweighting

2007-10-12 Thread Matthew Runo
would a dismax boost that's negative work?   ie.. name^-1  and   
description^-1 ?


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 12, 2007, at 1:13 PM, Brian Whitman wrote:

i have a dismax query where I want to boost appearance of the query  
terms in certain fields but "downboost" appearance in others.


The practical use is a field containing a lot of descriptive text  
and then a product name field where products might be named after a  
descriptive word. Consider an electric toothbrush called "The Fast  
And Thorough Toothbrush" -- if a user searches for fast toothbrush  
I'd like to down-weight that particular model's advantage. The name  
of the product might also be in the descriptive text.


I tried

 
-name description
 

but solr didn't like that.

Any better ideas?


--
http://variogr.am/







dismax downweighting

2007-10-12 Thread Brian Whitman
i have a dismax query where I want to boost appearance of the query  
terms in certain fields but "downboost" appearance in others.


The practical use is a field containing a lot of descriptive text and  
then a product name field where products might be named after a  
descriptive word. Consider an electric toothbrush called "The Fast  
And Thorough Toothbrush" -- if a user searches for fast toothbrush  
I'd like to down-weight that particular model's advantage. The name  
of the product might also be in the descriptive text.


I tried

 
-name description
 

but solr didn't like that.

Any better ideas?


--
http://variogr.am/





Re: solr not finding all results

2007-10-12 Thread Kevin Lewandowski
Sorry, I've figured out my own problem. There is a problem with the
way I create the xml document for indexing that was causing some of
the "comments" fields to not be listed correctly in the default search
field, "content".

On 10/12/07, Kevin Lewandowski <[EMAIL PROTECTED]> wrote:
> I've found an odd situation where solr is not returning all of the
> documents that I think it should. A search for "Geckoplp4-M" returns 3
> documents but I know that there are at least 100 documents with that
> string.
>
> Here is an example query for that phrase and the result set:
> http://localhost:9020/solr/select/?q=Geckoplp4-M&version=2.2&start=0&rows=10&indent=on&fl=comments,id
> 
> 
> 
>  0
>  0
>  
>   10
>   0
>   on
>   comments,id
>   Geckoplp4-M
>   2.2
>  
> 
> 
>  
>   Geckoplp4-M
>   m2816500
>  
>  
>   toptrax recordings. Same tracks.
> Geckoplp4-M
>   m2816544
>  
>  
>   Geckoplp4-M
>   m2815903
>  
> 
> 
>
> Now here's an example of a search for two documents that I know have
> that string, but were not returned in the previous search:
> http://localhost:9020/solr/select/?q=id%3Am2816615+OR+id%3Am2816611&version=2.2&start=0&rows=10&indent=on&fl=id,comments
> 
> 
> 
>  0
>  1
>  
>   10
>   0
>   on
>   id,comments
>   id:m2816615 OR id:m2816611
>   2.2
>  
> 
> 
>  
>   Geckoplp4-M
>   m2816611
>  
>  
>   Geckoplp4-M
>   m2816615
>  
> 
> 
>
> Here is the definition for the "comments" field:
> 
>
> And here is the definition for a "text" field:
> 
>   
>   
>   
>   
>generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>   
>   
>   
>   
>   
>   
>   
>synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>   
>generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>   
>   
>   
>   
>   
> 
>
> Any ideas? Am I doing something wrong?
>
> thanks,
> Kevin
>


solr not finding all results

2007-10-12 Thread Kevin Lewandowski
I've found an odd situation where solr is not returning all of the
documents that I think it should. A search for "Geckoplp4-M" returns 3
documents but I know that there are at least 100 documents with that
string.

Here is an example query for that phrase and the result set:
http://localhost:9020/solr/select/?q=Geckoplp4-M&version=2.2&start=0&rows=10&indent=on&fl=comments,id



 0
 0
 
  10
  0
  on
  comments,id
  Geckoplp4-M
  2.2
 


 
  Geckoplp4-M
  m2816500
 
 
  toptrax recordings. Same tracks.
Geckoplp4-M
  m2816544
 
 
  Geckoplp4-M
  m2815903
 



Now here's an example of a search for two documents that I know have
that string, but were not returned in the previous search:
http://localhost:9020/solr/select/?q=id%3Am2816615+OR+id%3Am2816611&version=2.2&start=0&rows=10&indent=on&fl=id,comments



 0
 1
 
  10
  0
  on
  id,comments
  id:m2816615 OR id:m2816611
  2.2
 


 
  Geckoplp4-M
  m2816611
 
 
  Geckoplp4-M
  m2816615
 



Here is the definition for the "comments" field:


And here is the definition for a "text" field:

  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


Any ideas? Am I doing something wrong?

thanks,
Kevin


Solr, operating systems and globalization

2007-10-12 Thread Jeff Rodenburg
We discovered and verified an issue in SolrSharp whereby indexing and
searching can be disrupted without taking Windows globalization & culture
settings into consideration.  For example, European cultures affect numeric
and date values differently from US/English cultures.  The resolution for
this type of issue is to specifically control the culture settings to allow
for index data formatting to work.

However, SolrSharp culture settings should be reflective and consistent with
the solr server instance's culture.  This leads to my question: does Solr
control its culture & language settings through the various language
components that can be incorporated, or does the underlying OS have a say in
how that data is treated?

Some education on this would be greatly appreciated.

cheers,
jeff r.


Re: Opensearch XSLT

2007-10-12 Thread Bill Fowler
There is a file ${SOLR_HOME}/conf/xslt/example_rss.xsl which is easily
modified to transform Solr's output to OpenSearch.  Works great, though
fixing the date format is a hassle.  The supported, searchable Solr date
format is not the OpensSearch standard.



On 10/12/07, Robert Young <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Does anyone know of an XSLT out there for transforming Solr's default
> output to Opensearch format? Our current frontend system uses
> opensearch so we would like to integrate it like this.
>
> Cheers
> Rob
>


Re: Opensearch XSLT

2007-10-12 Thread Walter Underwood
There is a request handler in 1.2 for Atom. That might be close.

OpenSearch was a pretty poor design and is dead now, so I wouldn't
expect any new implementations. Google's GData (based on Atom)
reuses the few useful OpenSearch elements needed for things
like number of hits. Solr's Atom support really should include
those.

http://code.google.com/apis/gdata/reference.html

wunder

On 10/12/07 4:59 AM, "Robert Young" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Does anyone know of an XSLT out there for transforming Solr's default
> output to Opensearch format? Our current frontend system uses
> opensearch so we would like to integrate it like this.
> 
> Cheers
> Rob



Opensearch XSLT

2007-10-12 Thread Robert Young
Hi,

Does anyone know of an XSLT out there for transforming Solr's default
output to Opensearch format? Our current frontend system uses
opensearch so we would like to integrate it like this.

Cheers
Rob


Re: quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-12 Thread Britske

as a related question: is here a way to inspect the queries currently in the
filtercache?


Britske wrote:
> 
> Yeah i meant filter-cache, thanks. 
> It seemed that the particular field (cityname) was using a
> keywordtokenizer (which doens't show at the front) which is why i missed
> it i guess :-S. This means the term field is tokenized so
> termEnums-apporach is used. This results in about 10.000 inserts on
> facet.field=cityname on a cold searcher, which matches the nr of different
> terms in that field. At least that explains that. 
> 
> So if I understand correctly if I use that same field in a FQ-param, say
> fq=cityname:amsterdam and amsterdam is a term of field cityname, than the
> FQ-query can utilize the cached 'query': cityname:amsterdam which is
> already put into the filtercache by the query facet.field=cityname right?
> 
> The thing that I still don't get is why my filtercache starts to have
> evictions although it's size is 16.000+.  This shouldn't be happing given
> that:
> I currently only use faceting on cityname and use this field on FQ as
> well, as already said (which adds +/- 1 items to the filtercache,
> given that faceting and fq share cached items). 
> Moreover i use FQ on about 2500 different fields (named _ddp*), but only
> check to see if a value exists by doing for example: fq=_ddp1234:[* TO *].
> I sometimes add them together like so: fq=_ddp1234:[* TO *]
> &fq=_ddp2345:[* TO *]. But never like so: fq=_ddp1234:[* TO *]
> +_ddp2345:[* TO *]. Which means each _ddp*-field is only added once to the
> filtercache. 
> 
> Wouldn't this mean that at a maximum I can only have 12500 items in the
> filtercache?
> Still my filtercache starts to have evictions although it's size is
> 16.000+. 
> 
> What am I missing here?
> Geert-Jan
> 
> 
> hossman wrote:
>> 
>> 
>> : ..fq=country:france
>> : 
>> : do these queries share cached items in the fieldcache? (in this
>> example:
>> : country:france) or do they somehow live as seperate entities in the
>> cache?
>> : The latter would explain my fieldcache having evictions at the moment.
>> 
>> FieldCache can't have evicitions.  it's a really low level "cache" where 
>> the key is field name and the value is an array containing a value for 
>> every document (you cna think of it as an inverted-inverted-index) that 
>> Lucene maintains directly.  items are never removed they just get garbage 
>> collected when the IndexReader is no longer used.  It's primarily for 
>> sorting, but the SimpleFacets code also leveragies it for facets in some 
>> cases -- Solr has no way of showing you what's in the FieldCache, because 
>> Lucene doesn't expose any inspection APIs to query it (it's a heisenberg 
>> cache .. once you ask if something is in it, it's in it)
>> 
>> are you refering to the "filterCache" ?  
>> 
>> filterCache contains records whose key is a "query" and whose value is a 
>> DocSet (an unordered collection of all docs matching a query) ... it's 
>> used whenever you use an "fq" param, for faceting on some fields (when
>> the 
>> TermEnum method is used, a filterCache entry is added for each term 
>> tested), and even for some sorted queries if the 
>>  config option is set to true.
>> 
>> the easiest way to know whether your faceting is using the FieldCache is 
>> to start your server cold (no newSearcher warming) and then send it a 
>> simple query with a single facet.field.  depending on the query, you
>> might 
>> get 0 or 1 entries in the filterCache if SimpleFacets is using the 
>> FieldCache -- but if it's using the TermEnums, and generating a DocSet
>> per 
>> term, you'llsee *lots* of inserts into the filterCache.
>> 
>> 
>> 
>> -Hoss
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13170530
Sent from the Solr - User mailing list archive at Nabble.com.