Advanced search with results matrix

2012-05-04 Thread Gnanakumar
Hi,

First off, we're a happy user of Apache Solr v3.1 Enterprise search server,
integrated and successfully running in our LIVE Production server.

Now, we're enhancing our existing search feature in our web application as
explained below, that truly helps application users in making informed
decision before getting their search results:

There will be 3 textboxes provided and users can enter keyword phrases with
OR, AND combination within each textbox as shown below, for example:
Textbox 1: SQL Server OR SQL
Textbox 2: Visual Basic OR VB.NET
Textbox 3: Java AND JavaScript 

If User clicks Search button, we want to present an intermediate or
results matrix page that would generate all possible combinations for 3
textboxes with how many records found for each combination as given below
(between combination it is AND operation).  This, as I said before, truly
helps application users in making informed decision/choice before getting
their search results:
+-+-+---
-
Matches |   Textbox 1 |   Textbox 2 | Textbox 3
+-+-+---
-
  200   |SQL Server OR SQL  | |
  300   | |Visual Basic OR VB.NET |   
  400   | | | Java AND
JavaScript
  250   |SQL Server OR SQL  |Visual Basic OR VB.NET |   
  350   | |Visual Basic OR VB.NET | Java AND
JavaScript
  300   |SQL Server OR SQL  | | Java AND
JavaScript
  100   |SQL Server OR SQL  |Visual Basic OR VB.NET | Java AND
JavaScript
+-+-+---
-
Only on clicking one of this Matches count will display actual results of
that particular search.

My questions are, 
1) Do I need to run search separately for each combination or is it
possible to combine and obtain results matrix page by making only one
single call to  Apache Solr?  Or are they any plug-ins available
that provides functionality close to my use case?
2) How do I instruct Solr to return only count (not result) for the
search performed?
3) Any ideas/suggestions/approaches/resources are really appreciated
and welcomed

Regards,
Gnanam




Re: Advanced search with results matrix

2012-05-04 Thread David Radunz

Hey Gnanam,

1. If I understand correctly you just need to perform one query. Like so 
(translated to propper syntax of course):
  (SQL Server OR SQL) OR (Visual Basic OR VB.NET) OR (Java AND 
JavaScript)
2. Every query you perform with Solr returns the 'results' count, if you 
ONLY want the results count simply set rows to 0 (but im guessing you 
will want both the results and the count as to avoid 2 trips).
  - The 'results count' is here: result name=response numFound=0 
start=0/  (being numFound)


David


On 4/05/2012 4:46 PM, Gnanakumar wrote:

Hi,

First off, we're a happy user of Apache Solr v3.1 Enterprise search server,
integrated and successfully running in our LIVE Production server.

Now, we're enhancing our existing search feature in our web application as
explained below, that truly helps application users in making informed
decision before getting their search results:

There will be 3 textboxes provided and users can enter keyword phrases with
OR, AND combination within each textbox as shown below, for example:
Textbox 1: SQL Server OR SQL
Textbox 2: Visual Basic OR VB.NET
Textbox 3: Java AND JavaScript

If User clicks Search button, we want to present an intermediate or
results matrix page that would generate all possible combinations for 3
textboxes with how many records found for each combination as given below
(between combination it is AND operation).  This, as I said before, truly
helps application users in making informed decision/choice before getting
their search results:
+-+-+---
-
Matches |   Textbox 1 |   Textbox 2 | Textbox 3
+-+-+---
-
   200  |SQL Server OR SQL  |   |
   300  | |Visual Basic OR VB.NET | 
   400  | | | Java AND
JavaScript
   250  |SQL Server OR SQL  |Visual Basic OR VB.NET |   
   350  | |Visual Basic OR VB.NET | Java AND
JavaScript
   300  |SQL Server OR SQL  |   | Java AND
JavaScript
   100  |SQL Server OR SQL  |Visual Basic OR VB.NET | Java AND
JavaScript
+-+-+---
-
Only on clicking one of this Matches count will display actual results of
that particular search.

My questions are,
1) Do I need to run search separately for each combination or is it
possible to combine and obtain results matrix page by making only one
single call to  Apache Solr?  Or are they any plug-ins available
that provides functionality close to my use case?
2) How do I instruct Solr to return only count (not result) for the
search performed?
3) Any ideas/suggestions/approaches/resources are really appreciated
and welcomed

Regards,
Gnanam






Re: SOLR 3.5 Index Optimization not producing single .cfs file

2012-05-04 Thread pravesh
Thanx Mike,

If you really must have a CFS (how come?) then you can call
TieredMergePolicy.setNOCFSRatio(1.0) -- not sure how/where this is
exposed in Solr though. 

BTW, would this impact the search performance? I mean i was just trying few
random keyword searches(without sort and filters) on both the system(1.4.1
vs 3.5) and found that 3.5 searches takes longer time than the 1.4.1(around
10-20% slower). Haven't done any load test till now

Regards
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-3-5-Index-Optimization-not-producing-single-cfs-file-tp3958619p3961441.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Parent-Child relationship

2012-05-04 Thread tamanjit.bin...@yahoo.co.in
Hi,
As per my understanding the join is confined to a single core only and it is
not possible to have joins between docs of different cores. Am I correct
here? If yes, is there a possibility of having joins across cores anytime
soon?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Parent-Child-relationship-tp3958259p3961509.html
Sent from the Solr - User mailing list archive at Nabble.com.


search case: Elision and truncate in french

2012-05-04 Thread Claire Hernandez

Hi all,

I have a little problem, I don't find an easy configuration solution but 
maybe my google search is wrong :)


- ElisionFilterFactory is enabled for searching and indexing analyzer.
- Index contains: *l'aventure*
= when I search *l'avent** solr finds nothing

I would have a solution which doesn't look sexy: having another index 
with a patternreplacecharfilterfactory wich removes all ' in strings.


Some tips would be usefull.

Thanks,
Claire;


RE: Advanced search with results matrix

2012-05-04 Thread Gnanakumar
 1. If I understand correctly you just need to perform one query. Like so 
 (translated to propper syntax of course):
(SQL Server OR SQL) OR (Visual Basic OR VB.NET) OR (Java AND 
 JavaScript)

No, it's not just one single query, rather, as I've mentioned before, it's
combination of searches with result count for each combination.  Explained
in detail below:
1) (SQL Server OR SQL)
2) (Visual Basic OR VB.NET)
3) (Java AND JavaScript)
4) (SQL Server OR SQL) AND (Visual Basic OR VB.NET)
5) (Visual Basic OR VB.NET) AND (Java AND JavaScript)
6) (SQL Server OR SQL) AND (Java AND JavaScript)
7) (SQL Server OR SQL) AND (Visual Basic OR VB.NET) AND (Java AND
JavaScript)

Hope I made it clear.




Re: Advanced search with results matrix

2012-05-04 Thread Mikhail Khludnev
Hi,

have you considered to junk your subqueries into disjunction
(BooleanQuery.Occurs.SHOULD) and request
http://wiki.apache.org/solr/SimpleFacetParameters#facet.query_:_Arbitrary_Query_Faceting?

On Fri, May 4, 2012 at 1:32 PM, Gnanakumar gna...@zoniac.com wrote:

  1. If I understand correctly you just need to perform one query. Like so
  (translated to propper syntax of course):
 (SQL Server OR SQL) OR (Visual Basic OR VB.NET) OR (Java AND
  JavaScript)

 No, it's not just one single query, rather, as I've mentioned before, it's
 combination of searches with result count for each combination.  Explained
 in detail below:
 1) (SQL Server OR SQL)
 2) (Visual Basic OR VB.NET)
 3) (Java AND JavaScript)
 4) (SQL Server OR SQL) AND (Visual Basic OR VB.NET)
 5) (Visual Basic OR VB.NET) AND (Java AND JavaScript)
 6) (SQL Server OR SQL) AND (Java AND JavaScript)
 7) (SQL Server OR SQL) AND (Visual Basic OR VB.NET) AND (Java AND
 JavaScript)

 Hope I made it clear.





-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


problem with date searching.

2012-05-04 Thread ayyappan
Hi 

  I'm having a slight problem with date searching... if i give same date
range in search query it seems to be working fine when try to give the
different date range and i am not getting result.

Ex : 
select/?defType=dismaxq=[*2012-02-02T01:30:52Z TO
2012-02-02T01:30:52Z*]qf=scanneddate

i am getting result result name=response numFound=20 start=0

if try different date range .

[2012-02-02T01:30:52Z TO 2011-09-22T22:40:30Z]

there is no record at all .please help me the same.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-date-searching-tp3961761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: problem with date searching.

2012-05-04 Thread ayyappan
thanks for quick response.

 I tried your advice .  [2011-09-22T22:40:30Z TO 2012-02-02T01:30:52Z]
like that even though i am not getting any result .

--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-date-searching-tp3961761p3961833.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: problem with date searching.

2012-05-04 Thread Dmitry Kan
unless, something else is wrong, my question would be, if you have the
documents in solr stamped with these dates?
also could try for a test specifying the field name directly:

q=scanneddate:[2011-09-22T22:40:30Z TO 2012-02-02T01:30:52Z]

also, in your first e-mail you said you have used

[*2012-02-02T01:30:52Z TO 2012-02-02T01:30:52Z*]

with asterisks *, what scanneddate values did you then get?

On Fri, May 4, 2012 at 1:37 PM, ayyappan ayyaba...@gmail.com wrote:

 thanks for quick response.

  I tried your advice .  [2011-09-22T22:40:30Z TO 2012-02-02T01:30:52Z]
 like that even though i am not getting any result .

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/problem-with-date-searching-tp3961761p3961833.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Dmitry Kan


Re: Faceting on a date field multiple times

2012-05-04 Thread Marc Sturlese
http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Faceting-on-a-date-field-multiple-times-tp3961282p3961865.html
Sent from the Solr - User mailing list archive at Nabble.com.


Word recognised in a search

2012-05-04 Thread mattia.martine...@gmail.com
Hi.

I'm making some searches using Apache SOLR 1.4, but I will upgrade to 3.6.

When SOLR uses stemming, it is very difficult to know what are the
words that are really found (for example, if I search ups SOLR find
up too).
I need to know that because I need to highlight founded words in the
text, and I need to extract some strings from the source using that
words.

I hope I managed in explain my problem well :-)

Could you help me, please?

Thank you very much!
Bye.


Re: Faceting on a date field multiple times

2012-05-04 Thread Ian Holsman
Thanks Marc.
On May 4, 2012, at 8:52 PM, Marc Sturlese wrote:

 http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Faceting-on-a-date-field-multiple-times-tp3961282p3961865.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Parent-Child relationship

2012-05-04 Thread Erick Erickson
See: https://issues.apache.org/jira/browse/LUCENE-3759

No time-frame mentioned though.

Best
Erick

On Fri, May 4, 2012 at 4:20 AM, tamanjit.bin...@yahoo.co.in
tamanjit.bin...@yahoo.co.in wrote:
 Hi,
 As per my understanding the join is confined to a single core only and it is
 not possible to have joins between docs of different cores. Am I correct
 here? If yes, is there a possibility of having joins across cores anytime
 soon?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Parent-Child-relationship-tp3958259p3961509.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Word recognised in a search

2012-05-04 Thread Dmitry Kan
have you tried HighlightComponent? hl=truehl.field=orig_text_field

- Dmitry

On Fri, May 4, 2012 at 1:52 PM, mattia.martine...@gmail.com 
mattia.martine...@gmail.com wrote:

 Hi.

 I'm making some searches using Apache SOLR 1.4, but I will upgrade to 3.6.

 When SOLR uses stemming, it is very difficult to know what are the
 words that are really found (for example, if I search ups SOLR find
 up too).
 I need to know that because I need to highlight founded words in the
 text, and I need to extract some strings from the source using that
 words.

 I hope I managed in explain my problem well :-)

 Could you help me, please?

 Thank you very much!
 Bye.




-- 
Regards,

Dmitry Kan


Re: get latest 50 documents the fastest way

2012-05-04 Thread Nagendra Nagarajayya
You can do this with Solr 4.0 with RankingAlgorithm 1.4.2. Please pass 
the below parameters to your search:


age=latestdocs=50

For eg:

http://localhost:8983/solr/select/?q=*:*age=latestdocs=50

This would inspect the latest last 50 documents in real time and returns 
results accordingly. Using *:* will not affect the performance and you 
will not need any additional ranking or sort, etc.


Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 5/1/2012 7:38 AM, Yuval Dotan wrote:

Hi Guys
We have a use case where we need to get the 50 *latest *documents that
match my query - without additional ranking,sorting,etc on the results.
My index contains 1,000,000,000 documents and i noticed that if the number
of found documents is very big (larger than 50% of the index size -
500,000,000 docs) than it takes more than 5 seconds to get the results even
with rows=50 parameter.
Is there a way to get the results faster?
Thanks
Yuval






Re: search case: Elision and truncate in french

2012-05-04 Thread Jack Krupansky
Unfortunately, use of a wildcard causes the normal token analysis processing 
to be completely bypassed, including the elision filter.  So, when using a 
wildcard you have to simulate in your head all of the analysis features, 
such as manually performing the elision.


-- Jack Krupansky

-Original Message- 
From: Claire Hernandez

Sent: Friday, May 04, 2012 5:08 AM
To: solr-user@lucene.apache.org
Cc: Jonathan Druart
Subject: search case: Elision and truncate in french

Hi all,

I have a little problem, I don't find an easy configuration solution but
maybe my google search is wrong :)

- ElisionFilterFactory is enabled for searching and indexing analyzer.
- Index contains: *l'aventure*
= when I search *l'avent** solr finds nothing

I would have a solution which doesn't look sexy: having another index
with a patternreplacecharfilterfactory wich removes all ' in strings.

Some tips would be usefull.

Thanks,
Claire; 



Re: search case: Elision and truncate in french

2012-05-04 Thread Erik Hatcher
Jack - that was true, until Solr 3.6+: 
http://wiki.apache.org/solr/MultitermQueryAnalysis

So, Claire, it's possible with the latest Solr release, to do this using bits 
and pieces of your existing analysis chain.

As Jack said, though, this is a manual chore in pre-Solr-3.6 releases.

Erik


On May 4, 2012, at 08:54 , Jack Krupansky wrote:

 Unfortunately, use of a wildcard causes the normal token analysis processing 
 to be completely bypassed, including the elision filter.  So, when using a 
 wildcard you have to simulate in your head all of the analysis features, such 
 as manually performing the elision.
 
 -- Jack Krupansky
 
 -Original Message- From: Claire Hernandez
 Sent: Friday, May 04, 2012 5:08 AM
 To: solr-user@lucene.apache.org
 Cc: Jonathan Druart
 Subject: search case: Elision and truncate in french
 
 Hi all,
 
 I have a little problem, I don't find an easy configuration solution but
 maybe my google search is wrong :)
 
 - ElisionFilterFactory is enabled for searching and indexing analyzer.
 - Index contains: *l'aventure*
 = when I search *l'avent** solr finds nothing
 
 I would have a solution which doesn't look sexy: having another index
 with a patternreplacecharfilterfactory wich removes all ' in strings.
 
 Some tips would be usefull.
 
 Thanks,
 Claire; 



Why would solr norms come up different from Lucene norms?

2012-05-04 Thread Benson Margulies
So, I've got some code that stores the same documents in a Lucene
3.5.0 index and a Solr 3.5.0 instance. It's only five documents.

For a particular field, the Solr norm is always 0.625, while the
Lucene norm is .5.

I've watched the code in NormsWriterPerField in both cases.

In Solr we've got .577, in naked Lucene it's .5.

I tried to check for boosts, and I don't see any non-1.0 document or
field boosts.

The Solr field is:

field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true
stored=true multiValued=false /


Single Index to Shards

2012-05-04 Thread michaelsever
If I have a single Solr index running on a Core, can I split it or migrate it
into 2 shards?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-Index-to-Shards-tp3962380.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: search case: Elision and truncate in french

2012-05-04 Thread Jack Krupansky
Okay, the issue is that only *some* of the filters are multi-term aware 
and the elision filter is one that is NOT multi-term aware.


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Friday, May 04, 2012 9:42 AM
To: solr-user@lucene.apache.org
Subject: Re: search case: Elision and truncate in french

Well, if it was fixed, then it is now broken again - in the 3.6 release!
Here’s a snippet from debugQuery showing that the generated query has the
elision intact in the analyzed term:

str name=rawquerystringtext_fr:l'avion*/str
str name=querystringtext_fr:l'avion*/str
str name=parsedquery+text_fr:l'avion*/str
str name=parsedquery_toString+text_fr:l'avion*/str

And for the same term without wildcard:

str name=rawquerystringtext_fr:l'avion/str
str name=querystringtext_fr:l'avion/str
str name=parsedquery+text_fr:avion/str
str name=parsedquery_toString+text_fr:avion/str

-- Jack Krupansky

-Original Message- 
From: Erik Hatcher

Sent: Friday, May 04, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: search case: Elision and truncate in french

Jack - that was true, until Solr 3.6+:
http://wiki.apache.org/solr/MultitermQueryAnalysis

So, Claire, it's possible with the latest Solr release, to do this using
bits and pieces of your existing analysis chain.

As Jack said, though, this is a manual chore in pre-Solr-3.6 releases.

Erik


On May 4, 2012, at 08:54 , Jack Krupansky wrote:

Unfortunately, use of a wildcard causes the normal token analysis 
processing to be completely bypassed, including the elision filter.  So, 
when using a wildcard you have to simulate in your head all of the 
analysis features, such as manually performing the elision.


-- Jack Krupansky

-Original Message- From: Claire Hernandez
Sent: Friday, May 04, 2012 5:08 AM
To: solr-user@lucene.apache.org
Cc: Jonathan Druart
Subject: search case: Elision and truncate in french

Hi all,

I have a little problem, I don't find an easy configuration solution but
maybe my google search is wrong :)

- ElisionFilterFactory is enabled for searching and indexing analyzer.
- Index contains: *l'aventure*
= when I search *l'avent** solr finds nothing

I would have a solution which doesn't look sexy: having another index
with a patternreplacecharfilterfactory wich removes all ' in strings.

Some tips would be usefull.

Thanks,
Claire; 




RE: Single Index to Shards

2012-05-04 Thread Keswani, Nitin - BLS CTR
Yes you can split your index into multiple shards

More info on shards can be found here : 

http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding

Thanks.

Regards,

Nitin Keswani


-Original Message-
From: michaelsever [mailto:sever_mich...@bah.com] 
Sent: Friday, May 04, 2012 9:44 AM
To: solr-user@lucene.apache.org
Subject: Single Index to Shards

If I have a single Solr index running on a Core, can I split it or migrate it 
into 2 shards?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-Index-to-Shards-tp3962380.html
Sent from the Solr - User mailing list archive at Nabble.com.



Documents With large number of fields

2012-05-04 Thread Keswani, Nitin - BLS CTR
Hi,

My data model consist of different types of data. Each data type has its own 
characteristics

If I include the unique characteristics of each type of data, my single Solr 
Document could end up containing 300-400 fields.

In order to drill down to this data set I would have to provide faceting on 
most of these fields so that I can drilldown to very small set of
Documents.

Here are some of the questions :

1) What's the best approach when dealing with documents with large number of 
fields .
Should I keep a single document with large number of fields or split my
document into a number of smaller  documents where each document would 
consist of some fields

2) From an operational point of view, what's the drawback of having a single 
document with a very large number of fields.
Can Solr support documents with large number of fields (say 300 to 400).


Thanks.

Regards,

Nitin Keswani



Re: problem with date searching.

2012-05-04 Thread Erick Erickson
Right, you need to do the explicit qualification of the date field.
dismax parsing is intended to work with text-type fields, not
numeric or date fields. If you attach debugQuery=on, you'll
see that your scanneddate field is just dropped.

Furthermore, dismax was never intended to work with range
queries. Note this from the DisMaxQParserPlugin page:

 extremely simplified subset of the Lucene QueryParser syntax

I'll expand on this a bit on the Wiki page.


Best
Erick

On Fri, May 4, 2012 at 6:45 AM, Dmitry Kan dmitry@gmail.com wrote:
 unless, something else is wrong, my question would be, if you have the
 documents in solr stamped with these dates?
 also could try for a test specifying the field name directly:

 q=scanneddate:[2011-09-22T22:40:30Z TO 2012-02-02T01:30:52Z]

 also, in your first e-mail you said you have used

 [*2012-02-02T01:30:52Z TO 2012-02-02T01:30:52Z*]

 with asterisks *, what scanneddate values did you then get?

 On Fri, May 4, 2012 at 1:37 PM, ayyappan ayyaba...@gmail.com wrote:

 thanks for quick response.

  I tried your advice .  [2011-09-22T22:40:30Z TO 2012-02-02T01:30:52Z]
 like that even though i am not getting any result .

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/problem-with-date-searching-tp3961761p3961833.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Regards,

 Dmitry Kan


Re: Single Index to Shards

2012-05-04 Thread Erick Erickson
There's no way to split an _existing_ index into multiple shards, although
some of the work on SolrCloud is considering being able to do this. You
have a couple of choices here:

1 Just reindex everything from scratch into two shards
2 delete all the docs from your index that will go into shard 2 and just
 index the docs for shard 2 in your new shard

But I want to be sure you're on the right track here. You only need to shard
if your index contains too many documents for your hardware to produce
decent query rates. If you are getting (and I'm picking this number out
of thin air) 50 QPS on your hardware (i.e. you're not stressing memory
etc) and just want to get to 150 QPS, use replication rather than sharding.

see: http://wiki.apache.org/solr/SolrReplication

Best
Erick

On Fri, May 4, 2012 at 9:44 AM, michaelsever sever_mich...@bah.com wrote:
 If I have a single Solr index running on a Core, can I split it or migrate it
 into 2 shards?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Single-Index-to-Shards-tp3962380.html
 Sent from the Solr - User mailing list archive at Nabble.com.


query keyword-tokenized fields with solrj

2012-05-04 Thread G.Long

Hi :)

In schema.xml I added a custom fieldType called keyword:

fieldType name=keyword class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
/fieldType

and a field called article :

field name=article type=keyword indexed=true stored=true/

Now I would like to query this field using solrj. I'm using the 
following code:



SolrQuery query = new SolrQuery(article:L. 111-5-2);
QueryResponse rsp = server.query(query);
list = rsp.getResults();

Even though there is only one entry in my index with the value L. 
111-5-2 in the field article I get a lot of results because the 
article value is not kept as a single token. I could change my string as 
article:\\L. 111-5-2\\ but I was wondering if there could be any 
prettier way to do that (programmatically with the solrj api) ?


Gary


Re: Documents With large number of fields

2012-05-04 Thread Darren Govoni
I'm also interested in this. Same situation.

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani
 




Re: how to present html content in browse

2012-05-04 Thread okayndc
Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string field
or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

 Make two fields, one with stores the stripped HTML and another that
 stores the parsed HTML. You can use copyField so that you do not
 have to submit the html page twice.

 You would mark the stripped field 'indexed=true stored=false' and the
 full text field the other way around. The full text field should be a
 String type.

 On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
  I am indexing records from database using DIH. The content of my record
 is in
  html format. When I use browse
  I would like to show the content in html format, not in text format. Any
  ideas?
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
  Sent from the Solr - User mailing list archive at Nabble.com.



 --
 Lance Norskog
 goks...@gmail.com



Re: 1MB file to Zookeeper

2012-05-04 Thread Yonik Seeley
On Fri, May 4, 2012 at 12:50 PM, Mark Miller markrmil...@gmail.com wrote:
 And how should we detect if data is compressed when
 reading from ZooKeeper?

 I was thinking we could somehow use file extensions?

 eg synonyms.txt.gzip - then you can use different compression algs depending 
 on the ext, etc.

 We would want to try and make it as transparent as possible though...

At first I thought about adding a marker to the beginning of a file, but
file extensions could work too, as long as the resource loader made it
transparent
(i.e. code would just need to ask for synonyms.txt, but the resource
loader would search
for synonyms.txt.gzip, etc, if the original name was not found)

Hmmm, but this breaks down for things like watches - I guess that's
where putting the encoding inside the file would be a better option.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: Faceting on a date field multiple times

2012-05-04 Thread SUJIT PAL
Hi Ian,

I believe you may be able to use a bunch of facet.query parameters, something 
like this:

facet.query=yourfield:[NOW-1DAY TO NOW]
facet.query=yourfield:[NOW-2DAY to NOW-1DAY]
...
and so on.

-sujit

On May 3, 2012, at 10:41 PM, Ian Holsman wrote:

 Hi.
 
 I would like to be able to do a facet on a date field, but with different 
 ranges (in a single query).
 
 for example. I would like to show
 
 #documents by day for the last week - 
 #documents by week for the last couple of months
 #documents by year for the last several years.
 
 is there a way to do this without hitting solr 3 times?
 
 
 thanks
 Ian



Re: query keyword-tokenized fields with solrj

2012-05-04 Thread Jack Krupansky
You have an embedded space in your keyword value, which must be escaped, 
somehow. So, the actual query can be written as


article:L. 111-5-2

or

article:L.\ 111-5-2

The later is slightly prettier, I suppose.

I suppose you could use a wildcard:

article:L.*111-5-2
article:L.?111-5-2

If you want to make it uglier, that would be easy:

article:L.\u0020111-5-2

-- Jack Krupansky

-Original Message- 
From: G.Long

Sent: Friday, May 04, 2012 11:48 AM
To: solr-user@lucene.apache.org
Subject: query keyword-tokenized fields with solrj

Hi :)

In schema.xml I added a custom fieldType called keyword:

fieldType name=keyword class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
/fieldType

and a field called article :

field name=article type=keyword indexed=true stored=true/

Now I would like to query this field using solrj. I'm using the
following code:


SolrQuery query = new SolrQuery(article:L. 111-5-2);
QueryResponse rsp = server.query(query);
list = rsp.getResults();

Even though there is only one entry in my index with the value L.
111-5-2 in the field article I get a lot of results because the
article value is not kept as a single token. I could change my string as
article:\\L. 111-5-2\\ but I was wondering if there could be any
prettier way to do that (programmatically with the solrj api) ?

Gary 



Template in a database field does not work. Please Help

2012-05-04 Thread RTI QA
I specified template in a field

field column=incident_id name=object_id
template=inc-${incident.incident_id} /

When doing full import, for each row retrieved from oracle, there is this
output in the console:

May 03, 2012 3:47:08 PM
org.apache.solr.handler.dataimport.TemplateTransformer transformRow

WARNING: Unable to resolve variable: incident.incident_id while parsing
expression: inc-${incident.incident_id}


Below is the data-config.xml file where the template is defined:


dataConfig



dataSource name=jdbc driver=oracle.jdbc.driver.OracleDriver
url=jdbc:oracle:thin:@//dbtest:1521/ORCL user=user password=xxx/





document

entity name=incident

  transformer=TemplateTransformer

  query=select incident_id, ('inc-' || incident_id ) unique_id,
long_desc from incident

  deltaQuery=select incident_id from incident where last_update
gt; TO_DATE('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') 

  



field column=incident_id name=incident_id/

field column=incident_id name=object_id
template=inc-${incident.incident_id} /

field column=unique_id name=unique_id /

field column=long_desc name=long_desc /

/entity

/document

/dataConfig




Have tried to change the template to


template=inc-${incident_id}


Still no luck, similar error.


Don't know what the TemplateTransformer is looking for to match the
variable.


Thanks,

RTI QA


elevate vs. select numFound results

2012-05-04 Thread roxy.noord...@wwecorp.com
I need help understanding the difference in the numFound number in the result
when I execute two queries against my solr instance, one with the elevation
and one without. I have a simple elevate.xml file created and working and am
searching for terms that are not meant to be elevated.

Elevate query
example.com:8080/solr/elevate?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
  for this the numFound is 125 in the result element of the XML

Select query
example.com:8080/solr/select?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
  for this the numFound is 154 in the result element of the XML

For many (most all) of my queries the numFound results are the same (both
with elevated query strings and with strings not in elevate.xml), but this
one is very different.

Should they be the same? Any idea what could make them different?
Thank you.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/elevate-vs-select-numFound-results-tp3963200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet and totaltermfreq

2012-05-04 Thread Dmitry Kan
I have tried (as a test) combining facets and term vectors (
http://wiki.apache.org/solr/TermVectorComponent ) in one query and was able
to get a list of facets and for each facet there was a term freq under
termVectors section. Not sure, if that's what you are trying to achieve.

-Dmitry

On Fri, May 4, 2012 at 8:37 PM, Jamie Johnson jej2...@gmail.com wrote:

 Is it possible when faceting to return not only the strings but also
 the total term frequency for those facets?  I am trying to avoid
 building a customized faceting component and making multiple queries.
 In our scenario we have multivalued fields which may have duplicates
 and I would like to be able to get a count of how many documents that
 term appears (currently what faceting does) but also how many times
 that term appears in general.




-- 
Regards,

Dmitry Kan


Re: Invalid version expected 2, but 60 on CentOS

2012-05-04 Thread Mark Miller

On May 4, 2012, at 4:09 PM, Ravi Solr wrote:

 Thanking you in anticipation,

Generally this happens because the webapp server is returning an html error 
response of some kind. Often it's a 404.

I think in trunk this might have been addressed - that is, it's easier to see 
the true error. Not positive though.

Some non success html response is likely coming back though.

- Mark Miller
lucidimagination.com













Re: how to present html content in browse

2012-05-04 Thread Jack Krupansky
Evidently there was a problem with highlighting of HTML that is supposedly 
fixed in Solr 3.6 and trunk:


https://issues.apache.org/jira/browse/SOLR-42

-- Jack Krupansky

-Original Message- 
From: okayndc

Sent: Friday, May 04, 2012 4:35 PM
To: solr-user@lucene.apache.org
Subject: Re: how to present html content in browse

Is it possible to return the HTML field highlighted?

On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky 
j...@basetechnology.comwrote:



1. The raw html field (call it, text_html) would be a string type
field that is stored but not indexed. This is the field you direct DIH
to output to. This is the field you would return in your search results
with the HTML to be displayed.

2. The stripped field (call it, text_stripped) would be a text type
field (where text is a field type you add that uses the HTML strip char
filter as shown below) that is not stored but is indexed. Add a
CopyField to your schema that copies from the raw html field to the
stripped field (say, text_html to text_stripped.)

For reference on HTML strip (HTMLStripCharFilterFactory), see:
http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Which has:

fieldtype name=text class=solr.TextField
 analyzer
  charFilter class=solr.**HTMLStripCharFilterFactory/
  charFilter class=solr.**MappingCharFilterFactory mapping=mapping-**
ISOLatin1Accent.txt/
  tokenizer class=solr.**StandardTokenizerFactory/
  filter class=solr.**LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory**/
  filter class=solr.**PorterStemFilterFactory/
 /analyzer
/fieldtype

Although, you might want to call that field type text_stripped to avoid
confusion with a simple text field

You can add HTMLStripCharFilterFactory to some other field type that you
might want to use, but this charFilter needs to be before the
tokenizer. The text field type above is just an example.

-- Jack Krupansky

-Original Message- From: okayndc
Sent: Friday, May 04, 2012 1:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to present html content in browse


Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string 
field

or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

 Make two fields, one with stores the stripped HTML and another that

stores the parsed HTML. You can use copyField so that you do not
have to submit the html page twice.

You would mark the stripped field 'indexed=true stored=false' and the
full text field the other way around. The full text field should be a
String type.

On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
 I am indexing records from database using DIH. The content of my record
is in
 html format. When I use browse
 I would like to show the content in html format, not in text format. 
 Any

 ideas?

 --
 View this message in context:
http://lucene.472066.n3.**nabble.com/how-to-present-**
html-content-in-browse-**tp3960327.htmlhttp://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
 Sent from the Solr - User mailing list archive at Nabble.com.



--
Lance Norskog
goks...@gmail.com








Re: elevate vs. select numFound results

2012-05-04 Thread Jack Krupansky

Some ways that fewer docs might be returned by query elevation:

1. The excude option: exclude=true in the xml file.
2. The exclusive request parameter: exclusive=true in the URL. (Certainly 
not your case.)
3. The exclusive request parameter default set to true in defaults for 
the /elevate request handler in solrconfig.
4. Some other query-related parameters (e.g., qf) are different between 
your /select and /elevate request handlers


Try adding enableElevation=false to your URL for /elevate, which should 
show you whether query elevation itself is affecting the number of docs, or 
if it must be some other parameters that are different between the two 
request handlers.


-- Jack Krupansky

-Original Message- 
From: roxy.noord...@wwecorp.com

Sent: Friday, May 04, 2012 3:21 PM
To: solr-user@lucene.apache.org
Subject: elevate vs. select numFound results

I need help understanding the difference in the numFound number in the 
result

when I execute two queries against my solr instance, one with the elevation
and one without. I have a simple elevate.xml file created and working and am
searching for terms that are not meant to be elevated.

Elevate query
example.com:8080/solr/elevate?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
 for this the numFound is 125 in the result element of the XML

Select query
example.com:8080/solr/select?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
 for this the numFound is 154 in the result element of the XML

For many (most all) of my queries the numFound results are the same (both
with elevated query strings and with strings not in elevate.xml), but this
one is very different.

Should they be the same? Any idea what could make them different?
Thank you.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/elevate-vs-select-numFound-results-tp3963200.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Single Index to Shards

2012-05-04 Thread Lance Norskog
If you are not using SolrCloud, splitting an index is simple:
1) copy the index
2) remove what you do not want via delete-by-query
3) Optimize!

#2 brings up a basic design question: you have to decide which
documents go to which shards. Mostly people use a value generated by a
hash on the actual id- this allows you to assign docs evenly.

http://wiki.apache.org/solr/UniqueKey

On Fri, May 4, 2012 at 4:28 PM, Young, Cody cody.yo...@move.com wrote:
 You can also make a copy of your existing index, bring it up as a second 
 instance/core and then send delete queries to both indexes.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, May 04, 2012 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Single Index to Shards

 There's no way to split an _existing_ index into multiple shards, although 
 some of the work on SolrCloud is considering being able to do this. You have 
 a couple of choices here:

 1 Just reindex everything from scratch into two shards
 2 delete all the docs from your index that will go into shard 2 and
 2 just
     index the docs for shard 2 in your new shard

 But I want to be sure you're on the right track here. You only need to shard 
 if your index contains too many documents for your hardware to produce 
 decent query rates. If you are getting (and I'm picking this number out of 
 thin air) 50 QPS on your hardware (i.e. you're not stressing memory
 etc) and just want to get to 150 QPS, use replication rather than sharding.

 see: http://wiki.apache.org/solr/SolrReplication

 Best
 Erick

 On Fri, May 4, 2012 at 9:44 AM, michaelsever sever_mich...@bah.com wrote:
 If I have a single Solr index running on a Core, can I split it or
 migrate it into 2 shards?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Single-Index-to-Shards-tp3962380.ht
 ml Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Minor type in example solrconfig: process of provided docuemnts

2012-05-04 Thread Jack Krupansky
I noticed this minor typo in the example solrconfig.xml for both 3.6 and trunk 
(as of 5/1):

An analysis handler that provides a breakdown of the analysis
process of provided docuemnts. This handler expects a (single)

“docuemnts” should be “documents”.

-- Jack Krupansky

Re: SOLRJ: Is there a way to obtain a quick count of total results for a query

2012-05-04 Thread Li Li
don't score by relevance and score by document id may speed it up a little?
I haven't done any test of this. may be u can give it a try. because
scoring will consume
some cpu time. you just want to match and get total count

On Wed, May 2, 2012 at 11:58 PM, vybe3142 vybe3...@gmail.com wrote:
 I can achieve this by building a query with start and rows = 0, and using
 queryResponse.getResults().getNumFound().

 Are there any more efficient approaches to this?

 Thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLRJ-Is-there-a-way-to-obtain-a-quick-count-of-total-results-for-a-query-tp3955322.html
 Sent from the Solr - User mailing list archive at Nabble.com.


RE: elevate vs. select numFound results

2012-05-04 Thread Noordeen, Roxy
I modified mysolrconfig.xml to:
requestHandler name=/elevate class=solr.SearchHandler startup=lazy
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
bool name=omitHeadertrue/bool
float name=tie0.01/float
str name=pfcontent^2.0/str
int name=ps15/int
!-- Abort any searches longer than 4 seconds --
!-- int name=timeAllowed4000/int --
str name=mm1/str
str name=q.alt*:*/str
/lst
arr name=last-components
strelevator/str
/arr
/requestHandler

Then added enableElevation=true parameter to my elevate url.
http://mydomain:8181/solr/elevate?q=dwayne+rock+johnsonwt=xmlsort=score+descfl=id,bundle_nameexclusive=truedebugQuery=onenableElevation=true

This made my /elevate parsed query to match my /select query, and I got back 
same numFound.

My parsedquery: 
str name=parsedquery
+((DisjunctionMaxQuery((content:dwayn)~0.01) 
DisjunctionMaxQuery((content:rock)~0.01) 
DisjunctionMaxQuery((content:johnson)~0.01))~1) 
DisjunctionMaxQuery((content:dwayn rock johnson~15^2.0)~0.01)
/str


But it would be nice to make exclusive=true work, and get  empty result set 
back when there is no matching elevation query. 
Is there any solrconfig settings to do so?




-Original Message-
From: Noordeen, Roxy [mailto:roxy.noord...@wwecorp.com] 
Sent: Friday, May 04, 2012 8:11 PM
To: solr-user@lucene.apache.org
Subject: RE: elevate vs. select numFound results

My actual problem is  with elevate not working with exclusive=true. I have a 
special pinned widget, that has to display only the nodes defined in my 
elevate.xml, kind of sponsored results.

If I define game in my elevte.xml, and send exclusive=true I get only the 
elevated entries.
http://my 
domain:8181/solr/elevate?q=gamewt=xmlsort=score+descfl=id,bundle_nameexclusive=true

but when I pass a word not defined in my elevate.xml, and send 
exclusive=true, I almost get same results like /select query.
http://my 
domain:8181/solr/elevate?q=gamenotdefinedwt=xmlsort=score+descfl=id,bundle_nameexclusive=true

So I ended up in using both elevate and select, if numbers [numFound] MATCH in 
both the request, I assume the word does not exist in my elevate.xml, and I had 
to hide my pinned widget.
But in few cases, my /elevate and /select are not returning same numFound. 
There are some differences in the numbers. 

Is there a way to force exclusive=true just to look at elevate.xml entries, 
and ignore the result from default search?

 answer to your questions:

1. There is no exclude=true parameter set in my elevate.xml

2. There is no exlusive=true set in url

3. My elevate entry in solrconfig.xml
searchComponent name=elevator class=solr.QueryElevationComponent
!-- pick a fieldType to analyze queries --
str name=queryFieldTypestring/str
str name=config-fileelevate.xml/str
!-- str name=refreshOnCommmittrue/str --
/searchComponent
!-- a request handler utilizing the elevator component --
requestHandler name=/elevate class=solr.SearchHandler startup=lazy
lst name=defaults
str name=echoParamsexplicit/str
/lst
arr name=last-components
strelevator/str
/arr
/requestHandler


4. I am not sure how to verify qf difference. I am using raw schema.xml and 
solrconfig.xml shipped with drupal solr module. I manage most of the solr 
configs via the drupal module, except at query time I query solr queries 
directly. 




-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, May 04, 2012 5:44 PM
To: solr-user@lucene.apache.org
Subject: Re: elevate vs. select numFound results

Some ways that fewer docs might be returned by query elevation:

1. The excude option: exclude=true in the xml file.
2. The exclusive request parameter: exclusive=true in the URL. (Certainly 
not your case.)
3. The exclusive request parameter default set to true in defaults for 
the /elevate request handler in solrconfig.
4. Some other query-related parameters (e.g., qf) are different between 
your /select and /elevate request handlers

Try adding enableElevation=false to your URL for /elevate, which should 
show you whether query elevation itself is affecting the number of docs, or 
if it must be some other parameters that are different between the two 
request handlers.

-- Jack Krupansky

-Original Message- 
From: roxy.noord...@wwecorp.com
Sent: Friday, May 04, 2012 3:21 PM
To: solr-user@lucene.apache.org
Subject: elevate vs. select numFound results

I need help understanding the difference in the numFound number in the 
result
when I execute two queries against my solr instance, one with the elevation
and one without. I have a simple elevate.xml file created and working and am
searching for terms that are not meant to be elevated.

Elevate query
example.com:8080/solr/elevate?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
  for this the numFound is 125 in the result element of the XML

Select query
example.com:8080/solr/select?q=dwayne+rock+johnsonwt=xmlsort=score+descrows=1
  for this the numFound is 154 in the result element of the XML

For many 

Re: how to present html content in browse

2012-05-04 Thread Lance Norskog
You need positions and offsets to do highlighting. A CharFilter does
not preserve positions.

I think you have to analyze the raw HTML with a different Analyzer, as
well as the stripper. I think this is how it works: use a new Analyzer
stack that uses the StandardAnalyzer, and the lower case filter and
stemmer/synonym etc. Now, store the HTML field with that text type.
You then search on the stripped field, but highlight from the raw
field with 'hl.fl'.

Here's the cool part: you do not actually need to index the raw HTML,
only store it. If you do not index a field, the Highlighter analyzes
the HTML when it needs the positions and offsets.

On Fri, May 4, 2012 at 2:25 PM, okayndc bodymo...@gmail.com wrote:
 Okay, thanks for the info.

 On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky j...@basetechnology.comwrote:

 Evidently there was a problem with highlighting of HTML that is supposedly
 fixed in Solr 3.6 and trunk:

 https://issues.apache.org/**jira/browse/SOLR-42https://issues.apache.org/jira/browse/SOLR-42


 -- Jack Krupansky

 -Original Message- From: okayndc
 Sent: Friday, May 04, 2012 4:35 PM

 To: solr-user@lucene.apache.org
 Subject: Re: how to present html content in browse

 Is it possible to return the HTML field highlighted?

 On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky j...@basetechnology.com**
 wrote:

  1. The raw html field (call it, text_html) would be a string type
 field that is stored but not indexed. This is the field you direct DIH
 to output to. This is the field you would return in your search results
 with the HTML to be displayed.

 2. The stripped field (call it, text_stripped) would be a text type
 field (where text is a field type you add that uses the HTML strip char
 filter as shown below) that is not stored but is indexed. Add a
 CopyField to your schema that copies from the raw html field to the
 stripped field (say, text_html to text_stripped.)

 For reference on HTML strip (HTMLStripCharFilterFactory), see:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltershttp://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s
 http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 


 Which has:

 fieldtype name=text class=solr.TextField
  analyzer
  charFilter class=solr.HTMLStripCharFilterFactory/
  charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-**
 ISOLatin1Accent.txt/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory/
  filter class=solr.PorterStemFilterFactory/

  /analyzer
 /fieldtype

 Although, you might want to call that field type text_stripped to avoid
 confusion with a simple text field

 You can add HTMLStripCharFilterFactory to some other field type that you
 might want to use, but this charFilter needs to be before the
 tokenizer. The text field type above is just an example.

 -- Jack Krupansky

 -Original Message- From: okayndc
 Sent: Friday, May 04, 2012 1:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to present html content in browse


 Hello,

 I'm having a hard time understanding this, and I had this same question.

 When using DIH should the HTML field be stored in the raw HTML string
 field
 or the stripped field?
 Also what source field(s) need to be copied and to what destination?

 Thanks


 On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

  Make two fields, one with stores the stripped HTML and another that

 stores the parsed HTML. You can use copyField so that you do not
 have to submit the html page twice.

 You would mark the stripped field 'indexed=true stored=false' and the
 full text field the other way around. The full text field should be a
 String type.

 On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
  I am indexing records from database using DIH. The content of my record
 is in
  html format. When I use browse
  I would like to show the content in html format, not in text format. 
 Any
  ideas?
 
  --
  View this message in context:
 http://lucene.472066.n3.**nabb**le.com/how-to-present-**http://nabble.com/how-to-present-**
 html-content-in-browse-tp3960327.htmlhttp://lucene.**
 472066.n3.nabble.com/how-to-**present-html-content-in-**
 browse-tp3960327.htmlhttp://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
 

  Sent from the Solr - User mailing list archive at Nabble.com.



 --
 Lance Norskog
 goks...@gmail.com








-- 
Lance Norskog
goks...@gmail.com


Re: Solr Merge during off peak times

2012-05-04 Thread Lance Norskog
Optimize takes a 'maxSegments' option. This tells it to stop when
there are N segments instead of just one.

If you use a very high mergeFactor and then call optimize with a sane
number like 50, it only merges the little teeny segments.

On Thu, May 3, 2012 at 8:28 PM, Shawn Heisey s...@elyograg.org wrote:
 On 5/2/2012 5:54 AM, Prakashganesh, Prabhu wrote:

 We have a fairly large scale system - about 200 million docs and fairly
 high indexing activity - about 300k docs per day with peak ingestion rates
 of about 20 docs per sec. I want to work out what a good mergeFactor setting
 would be by testing with different mergeFactor settings. I think the default
 of 10 might be high, I want to try with 5 and compare. Unless I know when a
 merge starts and finishes, it would be quite difficult to work out the
 impact of changing mergeFactor. I want to be able to measure how long merges
 take, run queries during the merge activity and see what the response times
 are etc..


 With a lot of indexing activity, if you are attempting to avoid large
 merges, I would think you would want a higher mergeFactor, not a lower one,
 and do occasional optimizes during non-peak hours.  With a small
 mergeFactor, you will be merging a lot more often, and you are more likely
 to encounter merges of already-merged segments, which can be very slow.

 My index is nearing 70 million documents.  I've got seven shards - six large
 indexes with about 11.5 million docs each, and a small index that I try to
 keep below half a million documents.  The small index contains the newest
 documents, between 3.5 and 7 days worth.  With this setup and the way I
 manage it, large merges pretty much never happen.

 Once a minute, I do an update cycle.  This looks for and applies deletions,
 reinserts, and new document inserts.  New document inserts happen only on
 the small index, and there are usually a few dozen documents to insert on
 each update cycle.  Deletions and reinserts can happen on any of the seven
 shards, but there are not usually deletions and reinserts on every update
 cycle, and the number of reinserts is usually very very small.  Once an
 hour, I optimize the small index, which takes about 30 seconds.  Once a day,
 I optimize one of the large indexes during non-peak hours, so every large
 index gets optimized once every six days.  This takes about 15 minutes,
 during which deletes and reinserts are not applied, but new document inserts
 continue to happen.

 My mergeFactor is set to 35.  I wanted a large value here, and this
 particular number has a side effect -- uniformity in segment filenames on
 the disk during full rebuilds.  Lucene uses a base-36 segment numbering
 scheme.  I usually end up with less than 10 segments in the larger indexes,
 which means they don't do merges.  The small index does do merges, but I
 have never had a problem with those merges going slowly.

 Because I do occasionally optimize, I am fairly sure that even when I do
 have merges, they happen with 35 very small segment files, and leave the
 large initial segment alone.  I have not tested this theory, but it seems
 the most sensible way to do things, and I've found that Lucene/Solr usually
 does things in a sensible manner.  If I am wrong here (using 3.5 and its
 improved merging), I would appreciate knowing.

 Thanks,
 Shawn




-- 
Lance Norskog
goks...@gmail.com


Re: Searching by location – What do I send to Solr?

2012-05-04 Thread Lance Norskog
You could just download postalcodes every day. To be nice, you could
pull the HEAD of each file and check if it is new.

This is just a set of tables, which you denormalize and add to your
other fields.

There are other sources of polygonal shape data, but there is no
official Solr toolkit for querying inside the irregular polygon.

On Thu, May 3, 2012 at 6:19 PM, Erick Erickson erickerick...@gmail.com wrote:
 The fact that they're python and java is largely beside the point I think.
 Solr just sees a URL, the fact that your Python app gets in there
 first and does stuff with the query wouldn't affect Solr at all.

 Also, I tend to like keeping Solr fairly lean so any work I can offload to
 the application I usually do.

 YMMV

 Best
 Erick

 On Thu, May 3, 2012 at 6:43 PM, Spadez james_will...@hotmail.com wrote:
 I discounted geonames to start with but it actually looks pretty good. I may
 be stretching the limit of my question here, but say I did go with geonames,
 if I go back to my model and add a bit:

 Search for London-Convert London to Long/Lat-Send Query to
 Solr-Return Query

 Since my main website is coded in Python, but Solr works in Java, if I was
 to create or use an existing script to allow me to convert London to
 Long/Lat, would it make more sense for this operation to be done in Python
 or Java?

 In Python it would integrate better with my website, but in Java it would
 integrate better with Solr. Also would one language be more suitable or
 faster for this kind of operation?

 Again, I might be pushing the boundaries of what I can ask on here, but if
 anyone can chime in with their opinion I would really appreciate it.

 ~ James

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Searching-by-location-What-do-I-send-to-Solr-tp3959296p3960666.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: Phrase Slop probelm

2012-05-04 Thread Lance Norskog
Maybe it could throw an exception because the user is clearly trying
to do something impossible.

On Wed, May 2, 2012 at 3:19 PM, Jack Krupansky j...@basetechnology.com wrote:
 You are missing the pf, pf2, and pf3 request parameters, which says
 which fields to do phrase proximity boosting on.

 pf boosts using the whole query as a phrase, pf2 boosts bigrams, and
 pf3 boost trigrams.

 You can use any combination of them, but if you use none of them, ps
 appears to be ignored.

 Maybe it should default to doing some boost if none of the field lists is
 given, like boost using bigrams in the qf fields, but it doesn't.

 -- Jack Krupansky

 -Original Message- From: André Maldonado
 Sent: Wednesday, May 02, 2012 3:29 PM
 To: solr-user@lucene.apache.org
 Subject: Phrase Slop probelm


 Hi all.

 In my index I have a multivalued field that contains a lot of information,
 all text searches are based on it. So, When I Do:

 http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all

 I got the same result as in:

 http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3
 *ps=0*qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all

 And the same result in:

 http://xxx.xx.xxx.xxx:/Index/select/?start=0rows=12q=term1+term2+term3
 *ps=10*
 qf=textoboostfq=field1%3aanother_termdefType=edismaxmm=100%25http://10.100.3.62:8984/solr/Index/select/?start=0rows=12q=churrasqueira+varanda+sacadaps=0qf=textoboost%20textofq=localexibicao%3azapdefType=edismaxmm=100%25debugQuery=trueechoParams=all

 What I'm doing wrong?

 Thank's

 *
 --
 *
 *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)*

 *andre.maldonado*@gmail.com andre.maldon...@gmail.com
 (11) 9112-4227

 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
 http://www.facebook.com/profile.php?id=10659376883
  http://twitter.com/andremaldonado
 http://www.delicious.com/andre.maldonado
  https://profiles.google.com/105605760943701739931
 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3
  http://www.youtube.com/andremaldonado



-- 
Lance Norskog
goks...@gmail.com


Re: correct XPATH syntax

2012-05-04 Thread Lance Norskog
The XPath implementation in DIH is very minimal- it is tuned for
speed, not features. The XSL option lets you do everything you could
want, with a slower engine.

On Thu, May 3, 2012 at 7:30 AM, lboutros boutr...@gmail.com wrote:
 ok, not that easy :)

 I did not test it myself but it seems that you could use an XSL
 preprocessing with the 'xsl' option in your XPathEntityProcessor :

 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

 You could transform the author part as you wish and then import the author
 field with your actual configuration.

 Ludovic.

 -
 Jouve
 France.
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/correct-XPATH-syntax-tp3951804p3959397.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: solr snapshots - old school and replication - new school ?

2012-05-04 Thread Lance Norskog
Yes. Replication is a lot easier to use and does a lot more.

On Thu, May 3, 2012 at 6:00 AM, geeky2 gee...@hotmail.com wrote:
 hello all,

 enviornment: centOS and solr 3.5

 i want to make sure i understand the difference between  snapshots and solr
 replication.

 snapshots are old school and have been deprecated with solr replication
 new school.

 do i have this correct?

 btw: i have replication working (now), between my master and two slaves - i
 just want to make sure i am not missing a larger picture ;)

 i have been reading the Smiley Pugh book (pg 349) as well as material on the
 wiki at:

 http://wiki.apache.org/solr/SolrCollectionDistributionScripts

 http://wiki.apache.org/solr/SolrReplication


 thank you,



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-snapshots-old-school-and-replication-new-school-tp3959152.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com