Re: Problem while indexing XML file with special characters represented uuml

2012-07-11 Thread Mike Sokolov
I think the issue here is that DIH uses Woodstox BasicStreamReader 
(see 
http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) 
which has only minimal DTD support.  It might be best to use 
ValidatingStreamReader 
(http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/ValidatingStreamReader.html) 
instead.


I think you could get this by requesting a validating XmlReader; that's 
a setting that's exposed at the factory level that returns a parser (ie 
an XmlReader).  But then you would probably also get validation turned 
on, which might not be so great in all cases.  Probably should be a user 
setting for XPathEntityProcessor somewhere?


-Mike

On 07/10/2012 07:10 PM, Chris Hostetter wrote:

: Somebody any idea? Solr seems to ignore the DTD definition and therefore
: does not understand the entities likeuuml; orauml; that are defined in
: dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
: definition?

Solr is just utilizing the builtin java XML parser for this, so there's
nothing you can tell solr to consider the DTD but it is odd that this
isn't working by default with java's parser -- i supsect there is some
hint XPathEntityProcessor should be giving hte parser to ask it to look
at these ENTITY declarations.

I've filed a Jira issue to try and track this (and included a test case)
but unfortunately i don't relaly know what the fix is...

https://issues.apache.org/jira/browse/SOLR-3614



-Hoss
   


Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Mike Sokolov
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't 
use a true XML parser?


You might want to try passing your documents through xmllint -noent 
(basically parse and reserialize) - that should inline the characters as 
UTF-8?


On 07/09/2012 03:18 PM, Michael Belenki wrote:

Somebody any idea? Solr seems to ignore the DTD definition and therefore
does not understand the entities likeuuml; orauml; that are defined in
dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD
definition?

On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenkiv...@belenki.name
wrote:
   

Dear community,

I am experiencing strange problem while trying to index / to import XML
document to SOLR via DataImportHandler. The XML document contains some
special characters (e.g. german ü) that are represented as XML entities
ü or ä. There is also DTD file that defines these entities
(!ENTITY uumlü) (I tried to use dtd file as well as to
include the DTD definition to the xml itself). After I start the import
command full-import, the import process throws an exception as soon as
 

it
   

tries to parse ü: Un
declared general entity uuml. Did anyone already face such a problem?

best regards,

Michael


My data-config for importing is:


dataConfig
 dataSource type=FileDataSource encoding=ISO-8859-1 /
 document
!--  stream should be true since huge xml document is being 
parsed
 

--
   

 entity name=article
 processor=XPathEntityProcessor
 stream=true
 forEach=/dblp/article
 url=documents/dblp.xml

 
 field column=keyxpath=/dblp/article/@key /
 field column=title xpath=/dblp/article/title /


/entity
 /document
/dataConfig

The XML file looks e.g. like this:

?xml version=1.0 encoding=ISO-8859-1?

!DOCTYPE dblp [

 !ENTITY uumlü!-- small u, dieresis or umlaut mark --
]
dblp

article key=journals/fm/Riccardi09 mdate=2011-10-27
authorMarco Riccardi/author
titleSolution of Cubic and Quartic Equations.ü/title
pages117-122/pages
year2009/year
volume17/volume

journalFormalized Mathematics/journal

number1-4/number

 

eehttp://dx.doi.org/10.2478/v10037-009-0012-z/eeurldb/journals/fm/fm17.html#Riccardi09/url
   

/article/dblp

The stack-trace is:

05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1
05.07.2012 17:37:19 org.apache.solr.common.SolrException log
SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeE
xception: org.apache.solr.handler.dataimport.DataImportHandlerException:
Parsing
  failed for xml, url:documents/dblp.xml rows processed in this xml:2
 

last
   

row in
  this xml:{title=Common Subexpression Identification in General
 

Algebraic
   

System
s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:264)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:375)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:445)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:426)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataIm
portHandlerException: Parsing failed for xml, url:documents/dblp.xml
 

rows
   

proces
sed in this xml:2 last row in this xml:{title=Common Subexpression
Identificatio
n in General Algebraic Systems., $forEach=/dblp/article,
key=persons/Hall74} Pro
cessing Document # 3
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:621)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:327)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:225)
 ... 3 more
Caused by:
 

org.apache.solr.handler.dataimport.DataImportHandlerException:
   

Parsin
g failed for xml, url:documents/dblp.xml rows processed in this xml:2
 

last
   

row i
n this xml:{title=Common Subexpression Identification in General
 

Algebraic
   

Syste
ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3
 at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
Throw(DataImportHandlerException.java:72)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:504)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE
ntityProcessor.java:517)
 at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity
ProcessorBase.java:120)
 at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
XPathEntityProcessor.java:225)
 at

highlighting field boundary detection

2012-06-19 Thread Mike Sokolov
Does anybody know of a way to detect when the highlight snippet begins 
at the beginning of the field or ends at the end of the field using one 
of the standard highlighters shipped w/Solr?  We'd like to display 
ellipses only when there is additional text surrounding the snippet in 
the original


-Mike


Re: Efficiently mining or parsing data out of XML source files

2012-06-06 Thread Mike Sokolov
I agree, that seems odd.  We routinely index XML using either 
HTMLStripCharFilter, or XmlCharFilter (see patch: 
https://issues.apache.org/jira/browse/SOLR-2597), both of which parse 
the XML, and we don't see such a huge  speed difference from indexing 
other field types.  XmlCharFilter also allows you to specify which 
elements to index if you don't want the whole file.


-Mike

On 6/3/2012 8:42 AM, Erick Erickson wrote:

This seems really odd. How big are these XML files? Where are you parsing them?
You could consider using a SolrJ program with a SAX-style parser.

But the first question I'd answer is what is slow?. The implications
of your post is that
parsing the XML is the slow part, it really shouldn't be taking
anywhere near this long IMO...

Best
Erick

On Thu, May 31, 2012 at 9:14 AM, Van Tassell, Kristian
kristian.vantass...@siemens.com  wrote:

I'm just wondering what the general consensus is on indexing XML data to Solr 
in terms of parsing and mining the relevant data out of the file and putting 
them into Solr fields. Assume that this is the XML file and resulting Solr 
fields:

XML data:
mydoc id=1234
titlefoo/title
bar attr1=val1/
bazgarbage data/baz
/ mydoc

Solr Fields:
Id=1234
Title=foo
Bar=val1

I'd previously set this process up using XSLT and have since tested using 
XMLBeans, JAXB, etc. to get the relevant data. The speed at which this occurs, 
however, is not acceptable. 2800 objects take 11 minutes to parse and index 
into Solr.

The big slowdown appears to be that I'm parsing the data with an XML parser.

So, now I'm testing mining the data by opening the file as just a text file 
(using Groovy) and picking out relevant data using regular expression matching. 
I'm now able to parse (mine) the data and index the 2800 files in 72 seconds.

So I'm wondering if the typical solution people use is to go with a non-XML 
solution. It seems to make sense considering the search index would only want 
to store (as much data) as possible and not rely on the incoming documents 
being xml compliant.

Thanks in advance for any thoughts on this!
-Kristian











creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
I'm creating a some Solr plugins that index and search documents in a 
special way, and I'd like to make them as easy as possible to 
configure.  Ideally I'd like users to be able to just drop a jar in 
place without having to copy any configuration into schema.xml, although 
I suppose they will have to register the plugins in solrconfig.xml.


I tried making my UpdateProcessor core aware and creating FieldTypes 
and SchemaFields in the inform(SolrCore) method.  This was a good start, 
but I'm running into some issues getting the types properly 
initialized.  One of my types, for example, derives from TextField, but 
this seems to require an initialization pass in order to get its 
properties set up properly.  What I'm seeing is that my field values 
aren't being tokenized, even though I specify TOKENIZED when I create 
the SchemaField.  I'm beginning to get the feeling I'm doing something 
not-quite anticipated by the API designers.


My question is: is there a way to go about doing something like this 
that isn't swimming upstream?  Should I just give up and require users 
to incorporate my schema in the xml config?


Here is a code snippet for anyone willing to dig in a little:

/** Called when each core is initialized; we ensure that lux fields 
are configured. */

public void inform(SolrCore core) {
IndexSchema schema = core.getSchema();
MapString,SchemaField fields = schema.getFields();
if (fields.containsKey(lux_path)) {
return;
}
MapString,FieldType fieldTypes = schema.getFieldTypes();
FieldType luxTextWs = fieldTypes.get(lux_text_ws);
if (luxTextWs == null) {
luxTextWs = new TextField ();
luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer());
luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer());
fieldTypes.put(lux_text_ws, luxTextWs);
}
fields.put(lux_path, new SchemaField (lux_path, luxTextWs, 
0x233, )); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | 
OMIT_TF_POSITIONS | MULTIVALUED
fields.put(lux_elt_name, new SchemaField (lux_elt_name, new 
StrField(), 0x231, ));// INDEXED | OMIT_NORMS | OMIT_TF_POSITIONS | 
MULTIVALUED
fields.put(lux_att_name, new SchemaField (lux_att_name, new 
StrField(), 0x231, ));

// must call this after making changes to the field map:
schema.refreshAnalyzers();
}


Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
ok, never mind all is well - I had a mismatch between the 
schema-declared field and my programmatic field, where I was overzealous 
in using OMIT_TF_POSITIONS.


-Mike

On 6/2/2012 5:02 PM, Mike Sokolov wrote:
I'm creating a some Solr plugins that index and search documents in a 
special way, and I'd like to make them as easy as possible to 
configure.  Ideally I'd like users to be able to just drop a jar in 
place without having to copy any configuration into schema.xml, 
although I suppose they will have to register the plugins in 
solrconfig.xml.


I tried making my UpdateProcessor core aware and creating FieldTypes 
and SchemaFields in the inform(SolrCore) method.  This was a good 
start, but I'm running into some issues getting the types properly 
initialized.  One of my types, for example, derives from TextField, 
but this seems to require an initialization pass in order to get its 
properties set up properly.  What I'm seeing is that my field values 
aren't being tokenized, even though I specify TOKENIZED when I create 
the SchemaField.  I'm beginning to get the feeling I'm doing something 
not-quite anticipated by the API designers.


My question is: is there a way to go about doing something like this 
that isn't swimming upstream?  Should I just give up and require users 
to incorporate my schema in the xml config?


Here is a code snippet for anyone willing to dig in a little:

/** Called when each core is initialized; we ensure that lux 
fields are configured. */

public void inform(SolrCore core) {
IndexSchema schema = core.getSchema();
MapString,SchemaField fields = schema.getFields();
if (fields.containsKey(lux_path)) {
return;
}
MapString,FieldType fieldTypes = schema.getFieldTypes();
FieldType luxTextWs = fieldTypes.get(lux_text_ws);
if (luxTextWs == null) {
luxTextWs = new TextField ();
luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer());
luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer());
fieldTypes.put(lux_text_ws, luxTextWs);
}
fields.put(lux_path, new SchemaField (lux_path, luxTextWs, 
0x233, )); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | 
OMIT_TF_POSITIONS | MULTIVALUED
fields.put(lux_elt_name, new SchemaField (lux_elt_name, 
new StrField(), 0x231, ));// INDEXED | OMIT_NORMS | 
OMIT_TF_POSITIONS | MULTIVALUED
fields.put(lux_att_name, new SchemaField (lux_att_name, 
new StrField(), 0x231, ));

// must call this after making changes to the field map:
schema.refreshAnalyzers();
}




Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
Oh yes, final followup for the terminally curious; I also had to add 
this little class in order to get analysis turned on for my programmatic 
field:


class PathField extends TextField {

PathField (IndexSchema schema) {
setAnalyzer(new WhitespaceGapAnalyzer());
setQueryAnalyzer(new WhitespaceGapAnalyzer());
}

protected Field.Index getFieldIndex(SchemaField field, String 
internalVal) {

return Field.Index.ANALYZED;
}

}

On 6/2/2012 5:48 PM, Mike Sokolov wrote:
ok, never mind all is well - I had a mismatch between the 
schema-declared field and my programmatic field, where I was 
overzealous in using OMIT_TF_POSITIONS.


-Mike

On 6/2/2012 5:02 PM, Mike Sokolov wrote:
I'm creating a some Solr plugins that index and search documents in a 
special way, and I'd like to make them as easy as possible to 
configure.  Ideally I'd like users to be able to just drop a jar in 
place without having to copy any configuration into schema.xml, 
although I suppose they will have to register the plugins in 
solrconfig.xml.


I tried making my UpdateProcessor core aware and creating 
FieldTypes and SchemaFields in the inform(SolrCore) method.  This was 
a good start, but I'm running into some issues getting the types 
properly initialized.  One of my types, for example, derives from 
TextField, but this seems to require an initialization pass in order 
to get its properties set up properly.  What I'm seeing is that my 
field values aren't being tokenized, even though I specify TOKENIZED 
when I create the SchemaField.  I'm beginning to get the feeling I'm 
doing something not-quite anticipated by the API designers.


My question is: is there a way to go about doing something like this 
that isn't swimming upstream?  Should I just give up and require 
users to incorporate my schema in the xml config?


Here is a code snippet for anyone willing to dig in a little:

/** Called when each core is initialized; we ensure that lux 
fields are configured. */

public void inform(SolrCore core) {
IndexSchema schema = core.getSchema();
MapString,SchemaField fields = schema.getFields();
if (fields.containsKey(lux_path)) {
return;
}
MapString,FieldType fieldTypes = schema.getFieldTypes();
FieldType luxTextWs = fieldTypes.get(lux_text_ws);
if (luxTextWs == null) {
luxTextWs = new TextField ();
luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer());
luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer());
fieldTypes.put(lux_text_ws, luxTextWs);
}
fields.put(lux_path, new SchemaField (lux_path, 
luxTextWs, 0x233, )); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | 
OMIT_TF_POSITIONS | MULTIVALUED
fields.put(lux_elt_name, new SchemaField (lux_elt_name, 
new StrField(), 0x231, ));// INDEXED | OMIT_NORMS | 
OMIT_TF_POSITIONS | MULTIVALUED
fields.put(lux_att_name, new SchemaField (lux_att_name, 
new StrField(), 0x231, ));

// must call this after making changes to the field map:
schema.refreshAnalyzers();
}






Re: Populating 'multivalue' fields (m:1 relationships)

2012-05-11 Thread Mike Sokolov
You can specify a solr field as multi-valued, and then supply multiple 
values for it.  What that really does is concatenate all the values with 
a positional gap between them to prevent phrases and other positional 
queries from traversing the boundary between the distinct values.


-Mike

On 05/10/2012 12:22 PM, Klostermeyer, Michael wrote:

I am attempting to index a DB schema that has a many:one relationship.  I 
assume I would index this within Solr as a 'multivalue=true' field, is that 
correct?

I am currently populating the Solr index w/ a stored procedure in which each DB record is 
flattened into a single document in Solr.  I would like one of those Solr document 
fields to contain multiple values from the m:1 table (i.e. [fieldName]=1,3,6,8,7).  I then need to 
be able to do a fq=fieldname:3 and return the previous record.

My question is: how do I populate Solr with a multi-valued field for many:1 
relationships?  My first guess would be to concatenate all the values from the 
'many' side into a single DB column in the SP, then pipe that column into a 
multivalue=true Solr field.  The DB side of that will be ugly, but would the 
Solr side index this properly?  If so, what would be the delimiter that would 
allow Solr to index each element of the multivalued field?

[Warning: possible tangent below...but I think this question is relevant.  If 
not, tell me and I'll break it out]

I have gone out of my way to flatten the data within my SP prior to giving it to Solr.  
For my solution stated above, I would have the following data (Title being the many 
side of the m:1, and PK being the Solr unique ID):

PK | Name | Title
Pk_1 | Dwight | Sales, Assistant To The Regional Manager
Pk_2 | Jim | Sales
Pk_3 | Michael | Regional Manger

Below is an example of a non-flattened record set.  How would Solr handle a 
data set in which the following data was indexed:

PK | Name | Title
Pk_1 | Dwight | Sales
Pk_1 | Dwight | Assistant To The Regional Manager
Pk_2 | Jim | Sales
Pk_3 | Michael | Regional Manger

My assumption is that the second Pk_1 record would overwrite the first, thereby losing 
the Sales title from Pk_1.  Am I correct on that assumption?

I'm new to this ballgame, so don't be shy about pointing me down a different 
path if I am doing anything incorrectly.

Thanks!

Mike Klostermeyer

   


Re: StreamingUpdateSolrServer - exceptions not propagated

2012-03-27 Thread Mike Sokolov

On 3/27/2012 11:14 AM, Mark Miller wrote:

On Mar 27, 2012, at 10:51 AM, Shawn Heisey wrote:


On 3/26/2012 6:43 PM, Mark Miller wrote:

It doesn't get thrown because that logic needs to continue - you don't 
necessarily want one bad document to stop all the following documents from 
being added. So the exception is sent to that method with the idea that you can 
override and do what you would like. I've written sample code around stopping 
and throwing an exception, but I guess its not totally trivial. Other ideas for 
reporting errors have been thrown around in the past, but no work on it has 
gotten any traction.

It looks like StreamingUpdateSolrServer is not meant for situations where strict error 
checking is required.  I think the documentation should reflect that.  Would you be 
opposed to a javadoc update at the class level (plus a wiki addition) like the following? 
Because document inserts are handled as background tasks, exceptions and errors 
that occur during those operations will not be available to the calling program, but they 
will be logged.  For example, if the Solr server is down, your program must determine 
this on its own.  If you need strict error handling, use CommonsHttpSolrServer.  If 
my wording is bad, feel free to make suggestions.

It might make sense to accumulate the errors in a fixed-size queue and 
report them either when the queue fills up or when the client commits 
(assuming the commit will wait for all outstanding inserts to complete 
or fail).  This is what we do client-side when performing multi-threaded 
inserts.  Sounds great in theory, I think, but then I haven't delved in 
to SUSS at all ... just a suggestion, take it or leave it.  Actually I 
wonder whether SUSS is necessary of you do the threading client-side?  
You might get a similar perf gain; I know we see a substantial speedup 
that way.  because then your updates spawn multiple threads in the 
server anyway, don't they?


- Mike


Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
If your ranges are always contiguous, you could index two fields: 
range-start and range-end and then perform queries like:


range-start:[* TO 30] AND range-end:[5 TO *]

If you have multiple ranges which could have gaps in between then you 
need something more complicated :)


On 02/27/2012 04:09 PM, federico.wachs wrote:

Hi all !

Here's my dreadful case, thank you for helping out! I want to have a
document like this:

doc
 ...
 arr name=occupiedDays  -- multivalued range field
  range1 TO 10/range
  range5 TO 15/range
 /arr
 ...
/doc
And the reason why I want to do this is because it's so much lighter than
having all the numbers in there, of course. Just to be clear, I want to
avoid having this in solr:

doc
 ...
 arr name=occupiedDays  -- multivalued range field
  str1/str
  str2/str
  str3/str
  str4/str
  str5/str
  str6/str
  str7/str
  str8/str
  str9/str
  str10/str
 /arr
 ...
/doc
And then perform range queries on this range field like: fq=-occupiedDays:[5
TO 30]

Anybody has any idea? I have asked and searched all over the internet and
seems solr does not support this.

Any help would be really helpful! Thanks in advanced.

Federico

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782083.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

I think your example case would end up like this:

doc
...
str name=start-range1/str  -- single-valued range field
str name=end-range15/str
...
/doc



On 02/27/2012 04:26 PM, federico.wachs wrote:

Michael thanks a lot for your quick answer, but i'm not exactly sure I
understand your solution.
How would the docuemnt you are proposing would look like? Do you mind
showing me a simple xml as example?

Again, thank you for your cooperation. And yes, the ranges are contiguous!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782139.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

No; contiguous means there are no gaps between them.

You need something like what you described initially.

Another approach is to de-normalize your data so that you have a single 
document for every range.  But this might or might not suit your 
application.  You haven't said anything about the context in which this 
is to be used.


-Mike

On 02/27/2012 04:43 PM, federico.wachs wrote:

Oh No, I think I understood wrong when you said that my ranges where
contiguous.

I could have ranges like this:

1 TO 15
5 TO 30
50 TO 60

And so on... I'm not sure that what you supposed would work, right?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782202.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
Yes, I see - I think your best bet is to index every day as a distinct 
value.  Don't worry about having 100's of values.


-Mike

On 02/27/2012 05:11 PM, federico.wachs wrote:

This is used on an apartment booking system, and what I store as solr
documents can be seen as apartments. These apartments can be booked for a
certain amount of days with a check in and a check out date hence the ranges
I was speaking of before.

What I want to do is to filter off the apartments that are booked so my
users won't have a bad user experience while trying to book an apartment
that suits their needs.

Did I make any sense? Please let me know, otherwise I can explain
furthermore.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782304.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
I don't know if this would help with OOM conditions, but are you using a 
tint type field for this?  That should be more efficient to search than 
a regular int or string.


-Mike

On 02/27/2012 05:27 PM, federico.wachs wrote:

Yeah that's what I'm doing right now.
But whenever I try to index an apartment that has many wide ranges, my
master solr server throws OutOfMemoryError ( I have set max heap to 1024m).
So I thought this could be a good workaround but puf it is a lot harder than
it seems!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782347.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: org.apache.pdfbox.pdmodel.PDPage Error

2011-10-25 Thread Mike Sokolov

On 10/24/2011 02:35 PM, MBD wrote:

Is this really a stumper? This is my first experience with Solr and having spent 
only an hour or so with it I hit this barrier (below). I'm sure *I* am doing 
something completely wrong just hoping someone more familiar with the platform can 
help me identify  fix it.

For starters...what's Could not initialize class ... mean in Java exactly? 
Maybe that the class (ie code) itself doesn't exist? - so perhaps I haven't downloaded 
all the pieces of the project? Or, could it be a hint that my kit is just not configured 
correctly? Sorry, I'm not a Java expert...but would like to get this stabilized...if 
possible.

   
Yeah - that's the problem. looks like the pdfbox jar is not installed in 
a place where Solr can find it (on its classpath).

If this is the wrong mailing list then just tell me and I'll go away...

Thanks!

On Oct 20, 2011, at 2:54 PM, MBD wrote:

   


Re: Index not getting refreshed

2011-09-15 Thread Mike Sokolov
Is it possible you have two solr instances running off the same index 
folder?  This was a mistake I stumbled into early on - I was writing 
with one, and reading with the other, so I didn't see updates.


-Mike

On 09/15/2011 12:37 AM, Pawan Darira wrote:

I am commiting but not doing replication now. Mine sort order also includes
last login timestamp. the new profiles are being reflected in my SOLR admin
  db. but its not listed on my website.

On Thu, Sep 15, 2011 at 4:25 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

   

: I am using Solr 3.2 on a live website. i get live user's data of about
2000
: per day. I do an incremental index every 8 hours. but my search results
: always show the same result with same sorting order. when i check the
same

Are you commiting?

Are you using replication?

Are you using a sort order that might not make it obvious that the new
docs are actaully there? (ie: sort=timestamp asc)


-Hoss

 
   


Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Mike Sokolov
Although you weren't very clear about it, it sounds as if you want the 
results to be sorted by a name that actually matched the query?  In 
general that is not going to be easy, since it is not something that can 
be computed in advance and thus indexed.



-Mike

On 08/03/2011 10:39 AM, Olson, Ron wrote:

Hi all-

Well, this is a problem. I have a list of names as a multi-valued field and I 
am searching on this field and need to return the results sorted. I know from 
searching and reading the documentation (and getting the error) that sorting on 
a multi-valued field isn't possible. Okay, so, what I haven't found is any real 
good solution/workaround to the problem. I was wondering what strategies others 
have done to overcome this particular situation; collapsing the individual 
names into a single field with copyField doesn't work because the name searched 
may not be the first name in the field.

Thanks for any hints/tips/tricks.

Ron

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.
   


Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Mike Sokolov

You have a few choices:

1) flatten your field structure - like your undesirable example, but 
wouldn't you want to have the document identifier as a field value also?


2) use phrase queries to make sure the key/value pairs are adjacent

3) use a join query

That's all I can think of

-Mike

On 08/01/2011 08:08 PM, Suk-Hyun Cho wrote:

I'm sure someone asked this before, but I couldn't find a previous post
regarding this.


The problem:


Let's say that I have a multivalued field called myFriends that tokenizes on
whitespaces. Basically, I'm treating it like a List of Lists (attributes of
friends):


Document A:

myFriends = [
 isCool=true SOME_JUNK_HERE gender=male bloodType=A
]

Document B:

myFriends = [
 isCool=true SOME_JUNK_HERE gender=female bloodType=O,
 isCool=false SOME_JUNK_HERE gender=male bloodType=AB
]

Now, let's say that I want to search for all the cool male friends I have.
Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male.
However, this returns documents A and B, because the two criteria are tested
against the entire collection, rather than against individual elements.


I could work around this by not tokenizing on whitespaces and using
wildcards:


q=myFriends:isCool=true\ *\ gender=male


but this becomes painful when the query becomes more complex. What if I
wanted to find cool friends who are either type A or type O? I could do
q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O).
And you can see that the number of criteria will just explode as queries get
more complex.


There are other methods that I've considered, such as duplicating documents
for every friend, like so:


Document A1:

myFriend = [
 isCool=true,
 gender=male,
 bloodType=A
]

Document B1:

myFriend = [
 isCool=true,
 gender=female,
 bloodType=O
]

Document B2:

myFriend = [
 isCool=false,
 gender=male,
 bloodType=AB
]

but this would be less than desirable.

I would like to hear any other ideas around solving this problem, but going
back to the original question, is there a way to match multiple criteria on
a per-item basis rather than against the entire multifield?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


ideas for versioning query?

2011-08-01 Thread Mike Sokolov
A customer has an interesting problem: some documents will have multiple 
versions. In search results, only the most recent version of a given 
document should be shown. The trick is that each user has access to a 
different set of document versions, and each user should see only the 
most recent version of a document that they have access to.


Is this something that can reasonably be solved with grouping?  In 3.x? 
I haven't followed the grouping discussions closely: would someone point 
me in the right direction please?


--
Michael Sokolov
Engineering Director
www.ifactory.com



Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
Thanks, Tomas.  Yes we are planning to keep a current flag in the most 
current document.  But there are cases where, for a given user, the most 
current document is not that one, because they only have access to some 
older documents.


I took a look at http://wiki.apache.org/solr/FieldCollapsing and it 
seems as if it will do what we need here.  My one concern is that it 
might not be efficient at computing group.ngroups for a very large 
number of groups, which we would ideally want.  Is that something I 
should be worried about?


-Mike

On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote:

Hi Michael, I guess this could be solved using grouping as you said.
Documents inside a group can be sorted on a field (in your case, the version
field, see parameter group.sort), and you can show only the first one. It
will be more complex to show facets (post grouping faceting is work in
progress but still not committed to the trunk).

I would be easier from the Solr side if you could do something at index
time, like indicating which document is the current one and which one is
an old one (you would need to update the old document whenever a new version
is indexed).

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com  wrote:

   

A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
different set of document versions, and each user should see only the most
recent version of a document that they have access to.

Is this something that can reasonably be solved with grouping?  In 3.x? I
haven't followed the grouping discussions closely: would someone point me in
the right direction please?

--
Michael Sokolov
Engineering Director
www.ifactory.com


 
   


Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
I think a 30% increase is acceptable. Yes, I think we'll try it.  
Although our case is more like # groups ~  # documents / N, where N is a 
smallish number (~1-5?).  We are planning for a variety of different 
index sizes, but aiming for a sweet spot around a few M docs.


-Mike

On 08/01/2011 11:00 AM, Martijn v Groningen wrote:

Hi Mike, how many docs and groups do you have in your index?
I think the group.sort option fits your requirements.

If I remember correctly group.ngroup=true adds something like 30% extra time
on top of the search request with grouping,
but that was on my local test dataset (~30M docs, ~8000 groups)  and my
machine. You might encounter different search times when setting
group.ngroup=true.

Martijn

2011/8/1 Mike Sokolovsoko...@ifactory.com

   

Thanks, Tomas.  Yes we are planning to keep a current flag in the most
current document.  But there are cases where, for a given user, the most
current document is not that one, because they only have access to some
older documents.

I took a look at 
http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsingand
 it seems as if it will do what we need here.  My one concern is that it
might not be efficient at computing group.ngroups for a very large number of
groups, which we would ideally want.  Is that something I should be worried
about?

-Mike


On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote:

 

Hi Michael, I guess this could be solved using grouping as you said.
Documents inside a group can be sorted on a field (in your case, the
version
field, see parameter group.sort), and you can show only the first one. It
will be more complex to show facets (post grouping faceting is work in
progress but still not committed to the trunk).

I would be easier from the Solr side if you could do something at index
time, like indicating which document is the current one and which one is
an old one (you would need to update the old document whenever a new
version
is indexed).

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com
  wrote:



   

A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
different set of document versions, and each user should see only the
most
recent version of a document that they have access to.

Is this something that can reasonably be solved with grouping?  In 3.x? I
haven't followed the grouping discussions closely: would someone point me
in
the right direction please?

--
Michael Sokolov
Engineering Director
www.ifactory.com




 


   
 


   


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Mike Sokolov
If you want to avoid re-indexing, you could consider building a synonym 
file that is generated using your rule set, and then using that to 
expand your queries.  You'd need to get a list of all terms in your 
index and then process them to generate synyonyms.  Actually, I don't 
know how to get a list of all the terms without Java programming, 
though: is there a way?


-Mike

On 08/01/2011 12:35 PM, thomas wrote:

Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: slow highlighting because of stemming

2011-07-29 Thread Mike Sokolov

I'm not sure I would identify stemming as the culprit here.

Do you have very large documents?  If so, there is a patch for FVH 
committed to limit the number of phrases it looks at; see 
hl.phraseLimit, but this won't be available until 3.4 is released.


You can also limit the amount of each document that is analyzed by the 
regular Highlighter using maxDocCharsToAnalyze (and maybe this applies 
to FVH? not sure)


Using RegexFragmenter is also probably slower than something like 
SimpleFragmenter.


There is work to implement faster highlighting for Solr/Lucene, but it 
depends on some basic changes to the search architecture so it might be 
a while before that becomes available.  See 
https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested 
in following that development.


-Mike

On 07/29/2011 04:55 AM, Orosz György wrote:

Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
highlighting
fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
 lst name=defaults
   int name=hl.fragsize500/int
   float name=hl.regex.slop0.5/float
   str name=hl.pre![CDATA[b]]/str
  str name=hl.post![CDATA[/b]]/str
  str name=hl.useFastVectorHighlightertrue/str
   str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
   str name=hl.fldokumentum_syn_query/str
 /lst
/fragmenter
   /highlighting
The field is indexed with term vectors and offsets:
field name=dokumentum_syn_query type=huntext_syn indexed=true
stored=true multiValued=true termVectors=on termPositions=on
  termOffsets=on/
 fieldType name=huntext_syn class=solr.TextField stored=true
indexed=true positionIncrementGap=100
   analyzer type=index
 tokenizer
class=com.morphologic.solr.huntoken.HunTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.SynonymFilterFactory
synonyms=synonyms_query.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz

   


Re: strip html from data

2011-07-25 Thread Mike Sokolov
I think you need to list the charfilter earlier in the analysis chain; 
before the tokenizer.  Porbably Solr should tell you this...


-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:

sounds logical. I just changed it to the following, restarted and reindexed
with commit:

  fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 charFilter class=solr.HTMLStripCharFilterFactory/
 /analyzer
  /fieldType

Unfortunatelly that did not fix the error. There are stillh3  tags inside
the data. Although I believe there are viewer then before but I can not
prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsmamarkus.jel...@openindex.io

   

You've three analyzer elements, i wonder what that would do. You need to
add
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
 

Hi there,

I am trying to strip html tags from the data before adding the documents
   

to
 

the index. To do that I altered schem.xml like this:

  fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer
 charFilter class=solr.HTMLStripCharFilterFactory/
  tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
  /fieldType

 fields
 field name=text type=text indexed=true stored=true
required=false/
 /fields

Unfortunatelly this does not work, the hmtl tags likeh3  are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.

Has anybody an idea how to get this done? Thank you in advance for any
hint.

Merlin
   

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

 
   


Re: strip html from data

2011-07-25 Thread Mike Sokolov
Hmm - I'm not sure about that; see 
https://issues.apache.org/jira/browse/SOLR-2119


On 07/25/2011 12:01 PM, Markus Jelsma wrote:

charFilters are executed first regardless of their position in the analyzer.

On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
   

I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer.  Porbably Solr should tell you this...

-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
 

sounds logical. I just changed it to the following, restarted and
reindexed

with commit:
   fieldType name=text class=solr.TextField

positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer type=index

  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  charFilter
  class=solr.HTMLStripCharFilterFactory/

  /analyzer
  analyzer type=query

  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  charFilter
  class=solr.HTMLStripCharFilterFactory/

  /analyzer

   /fieldType

Unfortunatelly that did not fix the error. There are stillh3   tags
inside the data. Although I believe there are viewer then before but I
can not prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsmamarkus.jel...@openindex.io

   

You've three analyzer elements, i wonder what that would do. You need to
add
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
 

Hi there,

I am trying to strip html tags from the data before adding the
documents
   

to

 

the index. To do that I altered schem.xml like this:
   fieldType name=text class=solr.TextField

positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer type=index

  tokenizer
  class=solr.WhitespaceTokenizerFactory/  filter
  class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory/
  filter class=solr.PorterStemFilterFactory/

  /analyzer
  analyzer type=query

  tokenizer
  class=solr.WhitespaceTokenizerFactory/  filter
  class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.KeywordMarkerFilterFactory/
  filter class=solr.PorterStemFilterFactory/

  /analyzer
  analyzer

  charFilter
  class=solr.HTMLStripCharFilterFactory/

   tokenizer
   class=solr.WhitespaceTokenizerFactory/

  /analyzer

   /fieldType

  fields

  field name=text type=text indexed=true stored=true

required=false/

  /fields

Unfortunatelly this does not work, the hmtl tags likeh3   are still
present after restarting and reindexing. I also tryed
htmlstriptransformer, but this did not work either.

Has anybody an idea how to get this done? Thank you in advance for any
hint.

Merlin
   

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
 
   


Re: strip html from data

2011-07-25 Thread Mike Sokolov

Hmm that looks like it's working fine.  I stand corrected.


On 07/25/2011 12:24 PM, Markus Jelsma wrote:

I've seen that issue too and read comments on the list yet i've never had
trouble with the order, don't know what's going on. Check this analyzer, i've
moved the charFilter to the bottom:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=false expand=true/
filter class=solr.StopFilterFactory ignoreCase=false
words=stopwords.txt/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.SnowballPorterFilterFactory protected=protwords.txt
language=Dutch/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
charFilter class=solr.HTMLStripCharFilterFactory/
/analyzer

The analysis chain still does its job as i expect for the input:
spanbla bla/span

Index Analyzer
org.apache.solr.analysis.HTMLStripCharFilterFactory
{luceneMatchVersion=LUCENE_34}
textbla bla
org.apache.solr.analysis.WhitespaceTokenizerFactory
{luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
startOffset 6   10
endOffset   9   13
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
generateWordParts=1, catenateAll=0, catenateNumbers=1}
position1   2
term text   bla bla
startOffset 6   10
endOffset   9   13
typewordword
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
startOffset 6   10
endOffset   9   13
typewordword
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
typewordword
startOffset 6   10
endOffset   9   13
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=false, luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
typewordword
startOffset 6   10
endOffset   9   13
org.apache.solr.analysis.ASCIIFoldingFilterFactory
{luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
typewordword
startOffset 6   10
endOffset   9   13
org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt,
language=Dutch, luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
keyword false   false
typewordword
startOffset 6   10
endOffset   9   13
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
{luceneMatchVersion=LUCENE_34}
position1   2
term text   bla bla
keyword false   false
typewordword
startOffset 6   10
endOffset   9   13


On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
   

Hmm - I'm not sure about that; see
https://issues.apache.org/jira/browse/SOLR-2119

On 07/25/2011 12:01 PM, Markus Jelsma wrote:
 

charFilters are executed first regardless of their position in the
analyzer.

On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
   

I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer.  Porbably Solr should tell you this...

-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
 

sounds logical. I just changed it to the following, restarted and
reindexed

with commit:
fieldType name=text class=solr.TextField

positionIncrementGap=100 autoGeneratePhraseQueries=true

   analyzer type=index

   tokenizer
   class=solr.WhitespaceTokenizerFactory/
   filter class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory/
   filter class=solr.PorterStemFilterFactory/
   charFilter
   class=solr.HTMLStripCharFilterFactory/

   /analyzer
   analyzer type=query

   tokenizer
   class=solr.WhitespaceTokenizerFactory/
   filter class=solr.WordDelimiterFilterFactory

generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory

Re: How do I specify a different analyzer at search-time?

2011-07-11 Thread Mike Sokolov
There is a syntax  that allows you to specify different analyzers to use 
for indexing and querying, in solr.xml.  But if you don't do that, it 
should use the same analyzer in both cases.


-Mike

On 07/11/2011 10:58 AM, Gabriele Kahlout wrote:

With a lucene QueryParser instance it's possible to set the analyzer in use.
I suspect Solr doesn't use the same analyzer it used at indexing, defined in
schema.xml but I cannot verify that without the queryparser instance.
 From Jan's diagram it seems this is set in the SearchHandler's init. Is it?
How?

On Sun, Apr 10, 2011 at 11:05 AM, Jan Høydahljan@cominvent.com  wrote:

   

Looks really good, but two bits that i think might confuse people are
the implications that a Query Parser then invokes a series of search
components; and that analysis (and the pieces of an analyzer chain)
are what to lookups in the underlying lucene index.

the first might just be the ambiguity of Query .. using the term
request parser might make more sense, in comparison to the update
parsing from the other side of hte diagram.
   

Thanks for commenting.

Yea, the purpose is more to show a conceptual rather than actual relation
between the different components, focusing on the flow. A 100% technical
correct diagram would be too complex for beginners to comprehend,
although it could certainly be useful for developers.

I've removed the arrow between QueryParser and search components to
clarify.
The boxes first and foremost show that query parsing and response writers
are within the realm of search request handler.

 

the analysis piece is a little harder to fix cleanly.  you really want
   

the
 

end of the analysis chain to feed back up to the searh components, and
then show it (most of hte search components really) talking to the Lucene
index.
   

Yea, I know. Showing how Faceting communicate with the main index and
spellchecker with its spellchecker index could also be useful, but I think
that would be for another more detailed diagram.

I felt it was more important for beginners to realize visually that
analysis happens both at index and search time, and that the analyzers
align 1:1. At this stage in the digram I often explain the importance
of matching up the analysis on both sides to get a match in the index.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


 


   


Re: How do I add a custom field?

2011-07-07 Thread Mike Sokolov

Did you ever commit?

On 07/07/2011 01:58 PM, Gabriele Kahlout wrote:

so, how about this:
  Document doc = searcher.doc(i); // i get the doc
 doc.removeField(wc); // remove the field in case there's
 addWc(doc, docLength); //add the new field
writer.updateDocument(new Term(id, Integer.toString(i++)), doc);
//update the doc

For some reason it doesn't get added to the index. Should it?

On 7/3/11, Michael Sokolovsoko...@ifactory.com  wrote:
   

You'll need to index the field.  I would think you would want to
index/store the field along with the associated document, in which case
you'll have to reindex the documents as well - there's no single-field
update capability in Lucene (yet?).

-Mike

On 7/3/2011 1:09 PM, Gabriele Kahlout wrote:
 

Is there how I can compute and add the field to all indexed documents
without re-indexing? MyField counts the number of terms per document
(unique
word count).

On Sun, Jul 3, 2011 at 12:24 PM, lee carroll
lee.a.carr...@googlemail.comwrote:

   

Hi Gabriele,
Did you index any docs with your new field ?

The results will just bring back docs and what fields they have. They
won't
bring back null fields just because they are in your schema. Lucene
is schema-less.
Solr adds the schema to make it nice to administer and very powerful to
use.





On 3 July 2011 11:01, Gabriele Kahloutgabri...@mysimpatico.com   wrote:
 

Hello,

I want to have an additional  field that appears for every document in
search results. I understand that I should do this by adding the field
to
the schema.xml, so I add:
 field name=myField default=0 type=integer stored=true
indexed=false/
Then I restart Solr (so that I loads the new schema.xml) and make a
query
specifying that it should return myField too, but it doesn't. Will it do
only for newly indexed documents? Am I missing something?

--
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
   

time(x)
 

   Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the
   

email
 

does not contain a valid code then the email is not received. A valid
   

code
 

starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

   


   


 


   


Re: TermVectors and custom queries

2011-07-01 Thread Mike Sokolov
Yes, that's right.  But at the moment the HL code basically has to 
reconstruct and re-run your query - it doesn't have any special 
knowledge.  There's some work going on to try and fix that, but it seems 
like it's going to require some fairly major deep re-plumbing.


-Mike

On 07/01/2011 07:54 AM, Jamie Johnson wrote:

How would I know which ones were the ones I wanted?  I don't see how
from a query I couldn't match up the term vectors that met the query.
Seems like what needs to be done is have the highlighting on the solr
end where you have more access to the information I'm looking for.
Sound about right?

On Fri, Jul 1, 2011 at 7:26 AM, Michael Sokolovsoko...@ifactory.com  wrote:
   

I think that's all you can do, although there is a callback-style interface
that might save some time (or space).  You still need to iterate over all of
the vectors, at least until you get the one you want.

-Mike

On 6/30/2011 4:53 PM, Jamie Johnson wrote:
 

Perhaps a better question, is this possible?

On Mon, Jun 27, 2011 at 5:15 PM, Jamie Johnsonjej2...@gmail.comwrote:
   

I have a field named content with the following definition

field name=content type=text indexed=true stored=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/

I'm now trying to execute a query against content and get back the term
vectors for the pieces that matched my query, but I must be messing
something up.  My query is as follows:


http://localhost:8983/solr/select/?qt=tvrhq=content:testfl=contenttv.all=true

where the word test is in my content field.  When I get information back
though I am getting the term vectors for all of the tokens in that field.
How do I get back just the ones that match my search?

 


 


Re: Looking for Custom Highlighting guidance

2011-06-30 Thread Mike Sokolov
It's going to be a bit complicated, but I would start by looking at 
providing a facility for merging an array of FieldTermStacks. The 
constructor for FieldTermStack() takes a fieldName and builds up a list 
of TermInfos (terms with positions and offsets): I *think* that if you 
make two of these, merge them, and hand that to the FieldPhraseList 
constructor (this is done in the main FVH class), you should get what 
you want.  This is a bit speculative; I haven't tried it.


-Mike

On 06/30/2011 08:26 AM, Jamie Johnson wrote:

Thanks for the suggestion Mike, I will give that a shot.  Having no
familiarity with FastVectorHighlighter is there somewhere specific I
should be looking?

On Wed, Jun 29, 2011 at 3:20 PM, Mike Sokolovsoko...@ifactory.com  wrote:
   

Does the phonetic analysis preserve the offsets of the original text field?

If so, you should probably be able to hack up FastVectorHighlighter to do what 
you want.

-Mike

On 06/29/2011 02:22 PM, Jamie Johnson wrote:
 

I have a schema with a text field and a text_phonetic field and would like
to perform highlighting on them in such a way that the tokens that match are
combined.  What would be a reasonable way to accomplish this?


   


Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov
Yes, after posting that response, I read some more and came to the same 
conclusion... there seems to be some interest on the dev list in 
building a capability to specify an analysis chain for use with wildcard 
and related queries, but it doesn't exist now.


-Mike

On 06/30/2011 10:34 AM, Jamie Johnson wrote:

I think my answer is here...

On wildcard and fuzzy searches, no text analysis is performed on the
search word. 

taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers


On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnsonjej2...@gmail.com  wrote:
   

I'm not familiar with the CharFilters, I'll look into those now.

Is the solr.LowerCaseFilterFactory not handling wildcards the expected
result or is this a bug?

On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolovsoko...@ifactory.com  wrote:
 

I wonder whether CharFilters are applied to wildcard terms?  I suspect they
might be.  If that's the case, you could use the MappingCharFilter to
perform lowercasing (and strip diacritics too if you want that)

-Mike

On 06/15/2011 10:12 AM, Jamie Johnson wrote:

So simply lower casing the works but can get complex.  The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO).  I think this would be much better solved on Solrs end, is
there a JIRA about this?

On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolovsoko...@ifactory.com  wrote:
   

opps, please s/Highlight/Wildcard/

On 06/14/2011 05:31 PM, Mike Sokolov wrote:
 

Wildcard queries aren't analyzed, I think?  I'm not completely sure what
the best workaround is here: perhaps simply lowercasing the query terms
yourself in the application.  Also - I hope someone more knowledgeable will
say that the new HighlightQuery in trunk doesn't have this restriction, but
I'm not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:
   

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com
  wrote:

 

I am using the following for my text field:

fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
!-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType

I have a field defined as
field name=Person_Name type=text stored=true indexed=true /

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

   


   
 


Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov

Yes, and this too: https://issues.apache.org/jira/browse/SOLR-219

On 06/30/2011 12:46 PM, Erik Hatcher wrote:

Jamie - there is a JIRA about this, at least 
one:https://issues.apache.org/jira/browse/SOLR-218

Erik

On Jun 15, 2011, at 10:12 , Jamie Johnson wrote:

   

So simply lower casing the works but can get complex.  The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO).  I think this would be much better solved on Solrs end, is
there a JIRA about this?

On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolovsoko...@ifactory.com  wrote:

 

opps, please s/Highlight/Wildcard/


On 06/14/2011 05:31 PM, Mike Sokolov wrote:

   

Wildcard queries aren't analyzed, I think?  I'm not completely sure what
the best workaround is here: perhaps simply lowercasing the query terms
yourself in the application.  Also - I hope someone more knowledgeable will
say that the new HighlightQuery in trunk doesn't have this restriction, but
I'm not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

 

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com
wrote:

I am using the following for my text field:
   

fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType

I have a field defined as
field name=Person_Name type=text stored=true indexed=true /

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?


 
   


Re: Looking for Custom Highlighting guidance

2011-06-29 Thread Mike Sokolov

Does the phonetic analysis preserve the offsets of the original text field?

If so, you should probably be able to hack up FastVectorHighlighter to 
do what you want.


-Mike

On 06/29/2011 02:22 PM, Jamie Johnson wrote:

I have a schema with a text field and a text_phonetic field and would like
to perform highlighting on them in such a way that the tokens that match are
combined.  What would be a reasonable way to accomplish this?

   


Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov

Actually - you are both wrong!

It is true that 0x is a valid UTF8 character, and not a valid UTF8 
byte sequence.


But the parser is reporting (or trying to) that 0x is an invalid XML 
character.


And Robert - if the wording offends you, you might want to send a note 
to Tatu (http://jira.codehaus.org/) suggesting that he alter the wording 
of the error message :)


-Mike

On 06/27/2011 09:01 AM, Bernd Fehling wrote:



Am 27.06.2011 14:48, schrieb Robert Muir:

On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:



correct!!!



but what i said, is totally different than what you said.

you are still wrong.


http://www.unicode.org/faq//utf_bom.html

see Q: What is a UTF?



Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
OK - re-reading your message it seems maybe that is what you were trying 
to say too, Robert.  FWIW I agree with you that XML is rigid, sometimes 
for purely arbitrary reasons.  But nobody has really helped Markus here 
- unfortunately, there is no easy way out of this mess.  What I do to 
handle issues like this is to wrap the stream I'm handing to the parser 
in some kind of cleanup stream that handles a few yucky issues.  You 
could, eg, just strip out invalid XML characters.  Maybe Nutch should be 
doing this, or at least handling the error better?


-Mike

On 06/27/2011 09:19 AM, Mike Sokolov wrote:

Actually - you are both wrong!

It is true that 0x is a valid UTF8 character, and not a valid UTF8 
byte sequence.


But the parser is reporting (or trying to) that 0x is an invalid 
XML character.


And Robert - if the wording offends you, you might want to send a note 
to Tatu (http://jira.codehaus.org/) suggesting that he alter the 
wording of the error message :)


-Mike

On 06/27/2011 09:01 AM, Bernd Fehling wrote:



Am 27.06.2011 14:48, schrieb Robert Muir:

On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:



correct!!!



but what i said, is totally different than what you said.

you are still wrong.


http://www.unicode.org/faq//utf_bom.html

see Q: What is a UTF?



Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
I don't think this is a BOM - that would be 0xfeff.  Anyway the problem 
we usually see w/processing XML with BOMs is in UTF8 (which really 
doesn't need a BOM since it's a byte stream anyway), in which if you 
transform the stream (bytes) into a reader (chars) before the xml parser 
can see it, the parser treats the BOM as white space.  But in that case 
you typically get a more specific error about invalid characters in the 
XML prolog, not just a random invalid character error.


-Mike

On 06/27/2011 10:33 AM, lee carroll wrote:

Hi Markus

I've seen similar issue before (but not with solr) when processing files as xml.
In our case the problem was due to processing a utf16 file with a byte
order mark. This presents itself as
0x to the xml parser which is not used by utf8 (the bom unicode
would be represented as efbfbf in utf8) This caused the utf8
aware parser to choke.

I don't want to get involved in any unicode / utf war as I'm confused
enough as it stands but
could you check for utf16 files before processing ?

lee c

On 27 June 2011 14:26, Thomas Fischerfischer...@aon.at  wrote:
   

Hello,

Am 27.06.2011 um 12:40 schrieb Markus Jelsma:

 

Hi,

I came across the indexing error below. It happened in a huge batch update
from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
the error back to a specific document. So i try my luck here: anyone seen this
before with SolrJ 3.1? Anything else on the Nutch part i should have taken
care off?

Thanks!


Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 
QTime=423
Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0x at char #1142033, byte #1155068)
   at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
   

and loads of other rubbish and

 

   ... 26 more
   


I see this as a problem of solr error-reporting. This is not only obnoxiously 
loud (white on grey with oversized fonts), but less useful than it should be.
Instead of telling the user where the error occurred (i.e. while reading which 
file, which column at which line) it unravels the stack. This is useless if the 
program just choked on some unexpected input, like a typo in a schema of config 
file or an invalid character in a file to be indexed.
I don't know if this is due to the Tomcat, the logging system of solr itself, 
but it is annoying.

And yes, I've seen something like this before and found the error not by 
inspecting solr but by opening the suspected files with an appropriate browser 
(e.g. Firefox) which tells me exactly where something goes wrong.

All the best
Thomas


 


Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Markus - if you want to make sure not to offend XML parsers, you should 
strip all characters not in this list:


http://en.wikipedia.org/wiki/XML#Valid_characters

You'll see that article talks about XML 1.1, which accepts a wider range 
of characters than XML 1.0, and I believe the Woodstox parser used in 
Solr adheres to that convention.  But note the restriction about control 
characters needing to be encoded - I'm not sure, but it might also be 
best to strip out chars  32 except for \r, \n and \t.  You definitely 
need to remove \0 also...


On 06/27/2011 11:59 AM, Markus Jelsma wrote:

Of course it doesn't work like this: use AND instead of OR!

On Monday 27 June 2011 17:50:01 Markus Jelsma wrote:
   

Hi all, thanks for your comments. I seem to have fixed it by now by simply
stripping away all non-character codepoints [1] by iterating over the
individual chars and checking them against:

if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch= 0xfdd0  ch
 

= 0xfdef)) { pass; }
   

Comments?

[1]: http://unicode.org/cldr/utility/list-
unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

On Monday 27 June 2011 12:40:16 Markus Jelsma wrote:
 

Hi,

I came across the indexing error below. It happened in a huge batch
update from Nutch with SolrJ 3.1. Since the crawl was huge it is very
hard to trace the error back to a specific document. So i try my luck
here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch
part i should have taken care off?

Thanks!


Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabinversion=2}
status=500 QTime=423 Jun 27, 2011 10:24:28 AM
org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#1142033, byte #1155068) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:
1 8) at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 657) at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
t StreamHandlerBase.java:67) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
s e.java:129) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va

:356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute

INFO: [] webapp=/solr path=/update params={wt=javabinversion=2}
status=500 QTime=423 Jun 27, 2011 10:24:28 AM
org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0x at char
#1142033, byte #1155068) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:
1 8) at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 657) at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
t StreamHandlerBase.java:67) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
s e.java:129) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va

:356) at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
v a:252) at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand
l er.java:1212) at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21
6 ) at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC
o llection.java:230) at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java
: 114) at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)

 at

org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j
av a:945) at 

Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov


On 06/22/2011 04:01 AM, Dennis de Boer wrote:

Hi Bill,

as far as I understood now, with the help of my friend, you can't.
Multivalued fields don't work that way.
You can however always filter the facet results manually in the JSP. You
knwo what the user chose as a facet.
   
Yes - that is the most sensible suggestion: if you want to display the 
facets the user chose, and only those, regardless of what was found in 
the index, then I think you know what to do!

The issue I ran into is when you have additional facet fields. For example
when you also have country as a facetfield. Now when you search for
Cardiologist, it also returns Internist and family doctor as you described.
What Sorl now also returns for the country list are the countries for
Cardiologist, but also for Internist  and family doctor. This is not what
you want.
   
I don't think this is accurate.  Your query matches some set of 
documents - the facet values shown will only be those that occur in that 
set.  If some internist's countries are shown when the user selects 
Cardiologist, that is because those internists are aldo cardiologists, 
right?


-Mike


Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov
We always remove the facet filter when faceting: in other words, for a 
good user experience, you generally want to show facets based on the 
query excluding any restriction based on the facets.
So in your example (facet B selected), we would continue to show *all* 
facets.  Only if you performed a search using some other filter 
(proximity, gender, etc), would we restrict the facet list.


-Mike

On 06/22/2011 09:42 AM, Dennis de Boer wrote:

Well, the use case is rather simple. It is not a use case but more auser
experience.

If I have a list of values I can facet on, for example :
A
B
C
D
E

And I click on B, does it make sense for the user to display
B
C
E

after the selection ? Just because items in B are C and E items as well?
As A user I chose B because I'm interested in B items. I do not care if they
are also C and E items.
Technically this is correct, but functional wise, the user doesn't care
because it is not what they searched for.

In this case they were searching for a Cardiologists. Do I care that a
cardiologist is also a family doctor? No. So I also do not want to see this
as a facet value presented to me in frontend logic.
In the item details you can show that the cardiologist is also a family
doctor. That is fine, but not as an availbale facet option, if you just
chose an speciality you want to filter on.

Does it make sense?


On Wed, Jun 22, 2011 at 3:31 PM, lee carroll
lee.a.carr...@googlemail.comwrote:

   

Hi Dennis,

I think maybe I just disagree. Your not showing facet counts for
cardiologists and Family Doctors independently. The Family Doctor
count will be all Family Doctors who are also Cardiologists.

This allows users to further filter Cardiologists who are also family
Doctors. (this could be of use to them ??)

If your front end app implements the filtering as a list of fq=xxx
then that would make for consistent results ?

I don't see how not showing that some cardiologists are also Family
Doctors is a better user experience... But again you might have a very
specific use case?

On 22 June 2011 13:44, Dennis de Boerdatdeb...@gmail.com  wrote:
 

Hi Lee,

since I have the same problem, I might as well try to answer this
   

question.
 

You want this behaviour to make things clear for your users. If they
   

select
 

cardiologists, does it make sense to also show family doctors as a
facetvalue to the user.
The same thing goed for the facets that are related to family doctors.
   

They
 

are returned as well, thus making it even moren unclear for the end-user.



On Wed, Jun 22, 2011 at 2:27 PM, lee carroll
lee.a.carr...@googlemail.comwrote:

   

Hi Bill,

 

So that part works. Then when I output the facet, I need a different
behavior than the default. I need
The facet to only output the value that matches (scored) - NOT ALL
   

VALUES
 

in the multiValued field.
   
 

I think it makes sense?
   

Why do you need this ? If your use case is faceted navigation then not
showing
all the facet terms which match your query would be mis-leading to your
users.
The fact is your data indicates Ben the cardiologist is also a GP etc.
Is it not valid for
your users to be able to further filter on cardiologists who are also
specialists in x other disciplines ? If the specialisms are mutually
exclusive then your data will reflect this.

The fact is x number of cardiologists match and x number of GP's match
 

etc
 

I may be missing the point here as you have not said why you need to do
this ?

cheers lee c


On 22 June 2011 09:34, Michael Kuhlmanns...@kuli.org  wrote:
 

Am 22.06.2011 09:49, schrieb Bill Bell:
   

You can type q=cardiology and match on cardiologist. If stemming did
 

not
 

work you can just add a synonym:

cardiology,cardiologist
 

Okay, synonyms are the only way I can think of a realistic match.

Stemming won't work on a facet field; you wouldn't get Cardiologist:
   

3
 

as the result but cardiolog: 3 or something like that instead.

Normally, you use declare facet field explicitly for facetting, and
   

not
 

for searching, exactly because stemming and tokenizing on facet fields
don't make sense.

And the short answer is: No, that's not possible.

-Kuli

   
 
   
 
   


Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
I'd be very interested in this, as well, if you do it before me and are 
willing to share...


A related question I have tried to ask on this list, and have never 
really gotten a good answer to, is whether it makes sense to just chuck 
the external storage and treat the lucene index as the primary storage 
for documents.  I have a feeling the answer is no; perhaps because of 
increased I/O costs for lucene and solr, but I don't really know.  I've 
been considering doing some experimentation, but would really love an 
expert opinion...


-Mike

On 06/20/2011 08:41 AM, Jamie Johnson wrote:

I am trying to index data where I'm concerned that storing the contents of a
specific field will be a bit of a hog so we are planning to retrieve this
information as needed for highlighting from an external source.  I am
looking to extend the default solr highlighting capability to work with
information pulled from this external source and it looks like this is
possible by extending DefaultSolrHighlighter (line 418 to pull a particular
field from external source) for standard highlighting and
BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
code this to say if the field name is a specific value look into the
external source, is this the best way to accomplish this?  Are there any
other extension points to do what I'm suggesting?

   


Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
Another option for determining whether to go to external storage would 
be to examine the SchemaField, see if it is stored, and if not, try to 
fetch from a file or whatever.  That way you won't have to configure 
anything.


-Mike

On 06/20/2011 09:46 AM, Jamie Johnson wrote:
In my case chucking the external storage is simply not an option.  
I'll definitely share anything I find,  the following is a very simple 
example of adding text to the default solr highlighter (had to copy a 
large portion of the class since the method that actually does the 
highlighting is private along with some classes to get this to run).  
If you look at the source it should hopefully make sense.



String[] docTexts = null;

if(fieldName.equals(title)){

SchemaField keyField = schema.getUniqueKeyField();
String key = doc.getValues(keyField.getName())[0];  //I 
know this field exists and is not multivalued
docTexts = doc.getValues(fieldName);  //this would be 
loaded from external store, but below just appends some information

if(key != null  key.length  0){
for(int x = 0; x  docTexts.length; x++){
docTexts[x] = docTexts[x] +  some added text;
}
}
}

I have cheated since I know the name of the field that (title) which I 
am doing this for but it would probably be useful to allow this to be 
set on the highlighter class through configuration in solrconfig (I'm 
not familiar at all with doing this and have spent 0 time looking into 
it).  Once configured the if(fieldName.equals(title)) line would be 
replaced with something like 
if(externalFields.contains(fieldName)){...} or something like that.


Thoughts/comments?

On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov soko...@ifactory.com 
mailto:soko...@ifactory.com wrote:


I'd be very interested in this, as well, if you do it before me
and are willing to share...

A related question I have tried to ask on this list, and have
never really gotten a good answer to, is whether it makes sense to
just chuck the external storage and treat the lucene index as the
primary storage for documents.  I have a feeling the answer is no;
perhaps because of increased I/O costs for lucene and solr, but I
don't really know.  I've been considering doing some
experimentation, but would really love an expert opinion...

-Mike


On 06/20/2011 08:41 AM, Jamie Johnson wrote:

I am trying to index data where I'm concerned that storing the
contents of a
specific field will be a bit of a hog so we are planning to
retrieve this
information as needed for highlighting from an external
source.  I am
looking to extend the default solr highlighting capability to
work with
information pulled from this external source and it looks like
this is
possible by extending DefaultSolrHighlighter (line 418 to pull
a particular
field from external source) for standard highlighting and
BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I
could just hard
code this to say if the field name is a specific value look
into the
external source, is this the best way to accomplish this?  Are
there any
other extension points to do what I'm suggesting?





Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
Yes that sounds about right.  I also have in mind an optimization for 
highlighting so it doesn't need to pull the whole field value.  The fast 
vector highlighter is working with offsets into the field, and should 
work better w/random access into the field value(s).  But that should 
come as a later optimization.


Another thing that bugs me about fvh is that it seems to need to 
recompute all the terms that matched the query for each retrieved field 
value when it seems like it ought to be able to make use of information 
gleaned during the actual query process, but that probably involves some 
deep change to cache that info during query scoring, and that is beyond 
my ken at the moment.


-Mike

On 06/20/2011 10:00 AM, Jamie Johnson wrote:
perhaps it should be an array that gets returned to be consistent with 
getValues(fieldName);


On Mon, Jun 20, 2011 at 9:59 AM, Jamie Johnson jej2...@gmail.com 
mailto:jej2...@gmail.com wrote:


Yes, in that case the code becomes

if(!schemaField.stored()){


SchemaField keyField = schema.getUniqueKeyField();
String key = doc.getValues(keyField.getName())[0];
docTexts = doc.getValues(fieldName);

if(key != null  key.length()  0){
for(int x = 0; x  docTexts.length; x++){
docTexts[x] = docTexts[x] +  some added text;
}
}
}


I'd imagine that we'd want some type of interface to actually pull
the text so you can plugin different providers, something like

ISolrExternalFieldProvider {
  public String getFieldContent(String key, SchemaField field);
}

not sure if there is anything else that interface should include
but that's all I would need at present.



On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov
soko...@ifactory.com mailto:soko...@ifactory.com wrote:

Another option for determining whether to go to external
storage would be to examine the SchemaField, see if it is
stored, and if not, try to fetch from a file or whatever. 
That way you won't have to configure anything.


-Mike


On 06/20/2011 09:46 AM, Jamie Johnson wrote:

In my case chucking the external storage is simply not an
option.  I'll definitely share anything I find,  the
following is a very simple example of adding text to the
default solr highlighter (had to copy a large portion of the
class since the method that actually does the highlighting is
private along with some classes to get this to run).  If you
look at the source it should hopefully make sense.


String[] docTexts = null;

if(fieldName.equals(title)){

SchemaField keyField = schema.getUniqueKeyField();
String key =
doc.getValues(keyField.getName())[0];  //I know this field
exists and is not multivalued
docTexts = doc.getValues(fieldName);  //this
would be loaded from external store, but below just appends
some information
if(key != null  key.length  0){
for(int x = 0; x  docTexts.length; x++){
docTexts[x] = docTexts[x] +  some added
text;
}
}
}

I have cheated since I know the name of the field that
(title) which I am doing this for but it would probably be
useful to allow this to be set on the highlighter class
through configuration in solrconfig (I'm not familiar at all
with doing this and have spent 0 time looking into it).  Once
configured the if(fieldName.equals(title)) line would be
replaced with something like
if(externalFields.contains(fieldName)){...} or something like
that.

Thoughts/comments?

On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov
soko...@ifactory.com mailto:soko...@ifactory.com wrote:

I'd be very interested in this, as well, if you do it
before me and are willing to share...

A related question I have tried to ask on this list, and
have never really gotten a good answer to, is whether it
makes sense to just chuck the external storage and treat
the lucene index as the primary storage for documents.  I
have a feeling the answer is no; perhaps because of
increased I/O costs for lucene and solr, but I don't
really know.  I've been considering doing some
experimentation, but would really love an expert opinion...

-Mike


On 06/20/2011 08:41 AM, Jamie Johnson wrote:

I am trying to index data where I'm concerned that
storing the contents of a
specific field will be a bit

Re: Text field case sensitivity problem

2011-06-15 Thread Mike Sokolov
I wonder whether CharFilters are applied to wildcard terms?  I suspect 
they might be.  If that's the case, you could use the MappingCharFilter 
to perform lowercasing (and strip diacritics too if you want that)


-Mike

On 06/15/2011 10:12 AM, Jamie Johnson wrote:
So simply lower casing the works but can get complex.  The query that 
I'm executing may have things like ranges which require some words to 
be upper case (i.e. TO).  I think this would be much better solved on 
Solrs end, is there a JIRA about this?


On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com 
mailto:soko...@ifactory.com wrote:


opps, please s/Highlight/Wildcard/


On 06/14/2011 05:31 PM, Mike Sokolov wrote:

Wildcard queries aren't analyzed, I think?  I'm not completely
sure what the best workaround is here: perhaps simply
lowercasing the query terms yourself in the application.  Also
- I hope someone more knowledgeable will say that the new
HighlightQuery in trunk doesn't have this restriction, but I'm
not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results

http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine

http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie
Johnsonjej2...@gmail.com mailto:jej2...@gmail.com  wrote:

I am using the following for my text field:

fieldType name=text class=solr.TextField
positionIncrementGap=100
autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at
query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true
expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both
the index and query
  analyzers to leave a 'gap' for more accurate
phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1
catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=0
catenateNumbers=0 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType

I have a field defined as
field name=Person_Name type=text stored=true
indexed=true /

when I execute a go to the following url I get results

http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*

http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*
but if I do

http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*

http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory
would have handled
lowercasing both the query and what is being indexed,
am I missing
something?




Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov
Wildcard queries aren't analyzed, I think?  I'm not completely sure what 
the best workaround is here: perhaps simply lowercasing the query terms 
yourself in the application.  Also - I hope someone more knowledgeable 
will say that the new HighlightQuery in trunk doesn't have this 
restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com  wrote:

   

I am using the following for my text field:

 fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
 /fieldType

I have a field defined as
field name=Person_Name type=text stored=true indexed=true /

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

 
   


Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov

opps, please s/Highlight/Wildcard/

On 06/14/2011 05:31 PM, Mike Sokolov wrote:
Wildcard queries aren't analyzed, I think?  I'm not completely sure 
what the best workaround is here: perhaps simply lowercasing the query 
terms yourself in the application.  Also - I hope someone more 
knowledgeable will say that the new HighlightQuery in trunk doesn't 
have this restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com  
wrote:



I am using the following for my text field:

fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
!-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and 
query

   analyzers to leave a 'gap' for more accurate phrase queries.
 --
filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType

I have a field defined as
field name=Person_Name type=text stored=true indexed=true /

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?



Re: Obtaining query AST?

2011-05-31 Thread Mike Sokolov
I believe there is a query parser that accepts queries formatted in XML, 
allowing you to provide a parse tree to Solr; perhaps that would get you 
the control you're after.


-Mike

On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote:

Hi,
  I want to write my own query expander. It needs to obtain the AST
(abstract syntax tree) of an already parsed query string, navigate to
certain parts of it (words) and make logical phrases of those words by
adding to the AST - where necessary.

This cannot be done to the string because the query logic cannot be
semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed
first.

How can this be done with SolrJ?

thanks for any tips.
Darren


   


Re: solr Invalid Date in Date Math String/Invalid Date String

2011-05-27 Thread Mike Sokolov
The * endpoint for range terms wasn't implemented yet in 1.4.1  As a 
workaround, we use very large and very small values.


-Mike

On 05/27/2011 12:55 AM, alucard001 wrote:

Hi all

I am using SOLR 1.4.1 (according to solr info), but no matter what date
field I use (date or tdate) defined in default schema.xml, I cannot do a
search in solr-admin analysis.jsp:

fieldtype: date(or tdate)
fieldvalue(index): 2006-12-22T13:52:13Z (I type it in manually, no trailing
space)
fieldvalue(query):

The only success case:
2006-12-22T13:52:13Z

All search below are failed:
* TO NOW
[* TO NOW]

2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z
2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z
[2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z]
[2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z]

2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z
2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z
[2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z]
[2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z]

2006-12-22T00:00:00Z TO *
2006\-12\-22T00\:00\:00Z TO *
[2006-12-22T00:00:00Z TO *]
[2006\-12\-22T00\:00\:00Z TO *]

2006-12-22T00:00:00.000Z TO *
2006\-12\-22T00\:00\:00\.000Z TO *
[2006-12-22T00:00:00.000Z TO *]
[2006\-12\-22T00\:00\:00\.000Z TO *]
(vice versa)

I get either:
Invalid Date in Date Math String or
Invalid Date String
error

What's wrong with it?  Can anyone please help me on that?

Thank you.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-Invalid-Date-in-Date-Math-String-Invalid-Date-String-tp2991763p2991763.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: Solr Highlight Component

2011-05-24 Thread Mike Sokolov
A possible workaround is to re-fetch the documents in your result set 
with a query that is:


+id=(id1 or id2 or ... id20) (highlight query)

where id1..20 are the doc ids in your result set

would require two round-trips though

-Mike

On 05/24/2011 08:19 AM, Koji Sekiguchi wrote:

(11/05/24 20:56), Lord Khan Han wrote:

Hi ,

Can I limit the terms that the HighlightComponent uses. My query is
generally long and I want specific ones to be highlighted and the 
rest is
not highlighted. Is there an option like the SpellCheckComponent. it 
uses q

unless spellcheck.q if specified. Is  a hl.q parameter possible?


No, but hl.q was proposed by me a year ago:

https://issues.apache.org/jira/browse/SOLR-1926

I'm sorry but no progress is there at this moment.

koji


Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea

2011-05-20 Thread Mike Sokolov

Cool!  suggestion: you might want to replace

externalVal.toLowerCase().split( );

with

externalVal.toLowerCase().split(\\s+);

also I bet folks might have different ideas about what to do with 
hyphens, so maybe:


externalVal.toLowerCase().split([-\\s]+);

In fact why not make it a configurable parameter?  Or - even better - 
use some other existing token analysis chain?  I'm not sure how to fit 
that into Solr's architecture: can you analyze a field value and still 
access the unanalyzed text?


-Mike


Re: document storage

2011-05-16 Thread Mike Sokolov

On 05/15/2011 11:48 AM, Erick Erickson wrote:

Where are the documents coming from? Because storing them ONLY in
Solr risks losing them if your index is somehow hosed.
   
In our case, we generally have source documents and can reproduce the 
index if need be, but that's a good point.

Storing them externally only has the advantage that your index will be
much smaller, which helps when replicating as you scale. The downside
here is that highlighting will be more resource-intensive since you're
re-analyzing text in order to highlight.
   
I had been imagining that the Highlighter could use stored term 
positions so as to avoid re-analysis.  Is this incompatible with 
external storage?


We might conceivably need to replicate the documents anyway, even if 
they are stored externally, in order to make them available to a farm of 
servers, although a SAN is another possibility here.


My main concern about storing internally was the cost of merging 
(optimizing) the index.  Presumably that would be increased if the docs 
are stored in it.

So, as usual, it depends (tm). What is the scale you need? What
is the QPS you're thinking of supporting?
   
Things are working well at a small scale, and in that environment I 
think all of these solutions work more or less equally well.  We're 
worrying about 10's of millions of documents and QPS around 50, so I 
expect we will have some significant challenges in coordinating a 
cluster of servers, and we're trying to plan as well as we can for 
that.  We expect updates to be performed in a batch mode - they don't 
have to be real-time, but they might need to be daily.


-Mike


Re: boolean versus non-boolean search

2011-05-16 Thread Mike Sokolov


On 05/16/2011 09:24 AM, Dmitry Kan wrote:

Dear list,

Might have missed it from the literature and the list, sorry if so, but:

SOLR 1.4.1
solrQueryParser defaultOperator=AND/


Consider the query:

term1 term2 OR term1 term2 OR term1 term3

   
I think what's happening is that your query gets rewritten into 
something like:


+term1 + (term2? term1 term2? term3?)

where in my notation term? means term is optional, and + means 
required.  So any document would match the second clause


-Mike


Re: [POLL] How do you (like to) do logging with Solr

2011-05-16 Thread Mike Sokolov
We use log4j explicitly and find it irritating to deal with the built-in 
JDK logging default.  We also have conflicts with other packages that 
have their own ideas about how to bind slf4j, so the less of this the 
better, IMO.  The 1.6.1 no-op default behavior seems a bit unfortunate 
as out-of-the-box behavior to me though. Not sure if there's anything to 
be done about that.  Can you log to stderr when there's no logger available?


-Mike

On 05/16/2011 04:43 AM, Jan Høydahl wrote:

Hi,

This poll is to investigate how you currently do or would like to do logging with Solr 
when deploying solr.war to a SEPARATE java application server (such as Tomcat, Resin etc) 
outside of the bundled solr/example. For background on how things work in 
Solr now, see http://wiki.apache.org/solr/SolrLogging and for more info on the SLF4J 
framework, see http://www.slf4j.org/manual.html

Please tick one of the options below with an [X]:

[ ]  I always use the JDK logging as bundled in solr.war, that's perfect
[ ]  I sometimes use log4j or another framework and am happy with re-packaging 
solr.war
[X]  Give me solr.war WITHOUT an slf4j logger binding, so I can choose at 
deploy time
[ ]  Let me choose whether to bundle a binding or not at build time, using an 
ANT option
[ ]  What's wrong with the solr/example Jetty? I never run Solr elsewhere!
[ ]  What? Solr can do logging? How cool!

Note that NOT bundling a logger binding with solr.war means defaulting to the 
NOP logger after outputting these lines to stderr:
SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

   


document storage

2011-05-13 Thread Mike Sokolov
Would anyone care to comment on the merits of storing indexed full-text 
documents in Solr versus storing them externally?


It seems there are three options for us:

1) store documents both in Solr and externally - this is what we are 
doing now, and gives us all sorts of flexibility, but doesn't seem like 
the most scalable option, at least in terms of storage space and I/O 
required when updating/inserting documents.


2) store documents externally: For the moment, the only thing that 
requires us to store documents in Solr is the need to highlight them, 
both in search result snippets and in full document views. We are 
considering hunting for or writing a Highlighter extension that could 
pull in the document text from an external source (eg filesystem).


3) store documents only in Solr.  We'd just retrieve document text as a 
Solr field value rather than reading from the filesystem.  Somehow this 
strikes me as the wrong thing to do, but it could work:  I'm not sure 
why.  A lot of unnecessary merging I/O activity perhaps.  Makes it hard 
to grep the documents or use other filesystem tools, I suppose.


Which one of these sounds best to you?  Under which circumstances? Are 
there other possibilities?


Thanks!

--

Michael Sokolov
Engineering Director
www.ifactory.com



Re: What is correct use of HTMLStripCharFilter in Solr 3.1

2011-05-12 Thread Mike Sokolov
It preserves the location of the terms in the original HTML document so 
that you can highlight terms in HTML.  This makes it possible (for 
instance) to display the entire document, with all the search terms 
highlighted, or (with some careful surgery) to display formatted HTML 
(bold, italic, etc) in your search results.


-Mike

On 05/12/2011 03:42 PM, Jonathan Rochkind wrote:

On 5/12/2011 2:55 PM, Ahmet Arslan wrote:

I recently upgraded from Solr 1.3 to Solr 3.1 in order to
take advantage of
the HTMLStripCharFilter. But it isn't working as I
expected.

You need to strip html tag before analysis phase. If you are using 
DIH, you can use stripHTML=true transformer.





Wait, then what's the HTMLStripCharFilter for?


Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
I think the key question here is what's the best way to perform indexing 
without affecting search performance, or without affecting it much.  If 
you have a batch of documents to index (say a daily batch that takes an 
hour to index and merge), you'd like to do that on an offline system, 
and then when ready, bring that index up for searching.  but using 
Lucene's multiple commit points assumes you use the same box for search 
and indexing doesn't it?


Something like this is what I have in mind (simple 2-server config here):

Box 1 is live and searching
Box 2 is offline and ready to index

loading begins on Box 2...
loading complete on Box 2 ...
commit, optimize

Swap Box 1 and Box 2 ( with a load balancer or application config?)
Box 2 is live and searching
Box 1 is offline and ready to index

To make the best use of your resources, you'd then like to start using 
Box 1 for searching (until indexing starts up again).  Perhaps if your 
load balancing is clever enough, it could be sensitive to the decreased 
performance of the indexing box and just send more requests to the other 
one(s).  That's probably ideal.


-Mike S


Under the hood, Lucene can support this by keeping multiple commit
points in the index.

So you'd make a new commit whenever you finish indexing the updates
from each hour, and record that this is the last searchable commit.

Then you are free to commit while indexing the next hour's worth of
changes, but these commits are not marked as searchable.

But... this is a low level Lucene capability and I don't know of any
plans for Solr to support multiple commit points in the index.

Mike

http://blog.mikemccandless.com

On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.comvrpar...@gmail.com  wrote:
   

Hello all,

indexing with dataimporthandler runs every hour (new records will be added,
some records will be updated) note :large data

requirement is when indexing is in progress, searching (on already indexed
data) should not affect

so should i use multicore-with merge and swap or delta query or any other
way?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html
Sent from the Solr - User mailing list archive at Nabble.com.

 


Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
Thanks - that sounds like what I was hoping for.  So the I/O during 
replication will have *some* impact on search performance, but 
presumably much less than reindexing and merging/optimizing?


-Mike


Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh itself,
quickly, efficiently, minimal impact to search performance. No need to build
extra moving parts for swapping search servers or anything like that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov
This is in 1.4 - we push updates via SolrJ; our application sees the 
updates, but when we use the solr admin screens to run test queries, or 
use Luke to view the schema and field values, it sees the database in 
its state prior to the commit.  I think eventually this seems to 
propagate, but I'm not clear how often since we generally restart the 
(tomcat) server in order to get the new commit to be visible.


I saw a comment recently (from Lance) that there is (annoying) HTTP 
caching enabled by default in solrconfig.xml.  Does this sound like 
something that would be caused by that cache?  If so, I'd probably want 
to disable it.   Does that affect performance of queries run via SolrJ?  
Also: why isn't that cache flushed by a commit?  Seems weird...


--
Michael Sokolov
Engineering Director
www.ifactory.com



Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov
Thanks - we are issuing a commit via SolrJ; I think that's the same 
thing, right?  Or are you saying really we need to do a separate commit 
(via HTTP) to update the admin console's view?


-Mike

On 05/02/2011 11:49 AM, Ahmet Arslan wrote:


This is in 1.4 - we push updates via SolrJ; our application sees the updates, 
but when we use the solr admin screens to run test queries, or use Luke to view 
the schema and field values, it sees the database in its state prior to the 
commit.  I think eventually this seems to propagate, but I'm not clear how 
often since we generally restart the (tomcat) server in order to get the new 
commit to be visible.


You need to issue a commit from HTTP interface to see the changes made by 
embedded solr server.
solr/update?commit=true

   


Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov

Ah - I didn't expect that.  Thank you!

On 05/02/2011 12:07 PM, Ahmet Arslan wrote:




Thanks - we are issuing a commit via SolrJ; I think that's the same
thing, right?  Or are you saying really we need to do a separate commit
(via HTTP) to update the admin console's view?

Yes separate commit is needed.
   


Re: Searching for escaped characters

2011-04-28 Thread Mike Sokolov
StandardTokenizer will have stripped punctuation I think.  You might try 
searching for all the entity names though:


(agrave | egrave | omacron | etc... )

The names are pretty distinctive.  Although you might have problems with 
greek letters.


-Mike

On 04/28/2011 12:10 PM, Paul wrote:

I'm trying to create a test to make sure that character sequences like
egrave; are successfully converted to their equivalent utf
character (that is, in this case, è).

So, I'd like to search my solr index using the equivalent of the
following regular expression:

\w{1,6};

To find any escaped sequences that might have slipped through.

Is this possible? I have indexed these fields with text_lu, which
looks like this:

fieldtype name=text_lu class=solr.TextField positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldtype

Thanks,
Paul
   


Re: Replicaiton Fails with Unreachable error when master host is responding.

2011-04-28 Thread Mike Sokolov

No clue. Try wireshark to gather more data?

On 04/28/2011 02:53 PM, Jed Glazner wrote:

Anybody?

On 04/27/2011 01:51 PM, Jed Glazner wrote:

Hello All,

I'm having a very strange problem that I just can't figure out. The
slave is not able to replicate from the master, even though the master
is reachable from the slave machine.  I can telnet to the port it's
running on, I can use text based browsers to navigate the master from
the slave. I just don't understand why it won't replicate.  The admin
screen gives me an Unreachable in the status, and in the log there is an
exception thrown.  Details below:

BACKGROUND:

OS: Arch Linux
Solr Version: svn revision 1096983 from
https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/
No custom plugins, just whatever came with the version above.
Java Setup:

java version 1.6.0_22
OpenJDK Runtime Environment (IcedTea6 1.10) (ArchLinux-6.b22_1.10-1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

We have 3 cores running, all 3 cores are not able to replicate.

The admin on the slave shows  the Master as
http://solr-master-01_dev.la.bo:8983/solr/music/replication  - *Unreachable*
Replicaiton def on the slave

   529requestHandler name=/replication class=solr.ReplicationHandler
   530lst name=${slave:slave}
   531str
name=masterUrlhttp://solr-master-01_dev.la.bo:8983/solr/music/replication/str
   532str name=pollInterval00:15:00/str
   533/lst
   534/requestHandler

Replication def on the master:

   529requestHandler name=/replication class=solr.ReplicationHandler
   530lst name=${master:master}
   531str name=replicateAftercommit/str
   532str name=replicateAfterstartup/str
   533str name=confFilesschema.xml,stopwords.txt/str
   534/lst
   535/requestHandler

Below is the log start to finish for replication attempts, note that it
says connection refused, however, I can telnet to 8983 from the slave to
the master, so I know it's up and reachable from the slave:

telnet solr-master-01_dev.la.bo 8983
Trying 172.12.65.58...
Connected to solr-master-01_dev.la.bo.
Escape character is '^]'.

I double checked the master to make sure that it didn't have replication
turned off, and it's not.  So I should be able to replicate but it
can't.  I just dont' know what else to check.  The log from the slave is
below.

Apr 27, 2011 7:39:45 PM org.apache.solr.request.SolrQueryResponseinit
WARNING: org.apache.solr.request.SolrQueryResponse is deprecated. Please
use the corresponding class in org.apache.solr.response
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused
Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Apr 27, 2011 7:39:45 PM org.apache.solr.handler.ReplicationHandler
getReplicationDetails
WARNING: Exception while invoking 'details' method for replication on
master
java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
  at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
  at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
  at java.net.Socket.connect(Socket.java:546)
  at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:616)
  at
org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
  at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
  at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
  at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
  at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
  at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
  at

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Suppose your analysis stack includes lower-casing, but your synonyms are 
only supposed to apply to upper-case tokens.  For example, PET might 
be a synonym of positron emission tomography, but pet wouldn't be.


-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com  wrote:

   

But somehow this feels bad (well, so does sticking word variations in what's
supposed to be a synonyms file), partly because it means that the person adding
new synonyms would need to know what they stem to (or always check it against
Solr before editing the file).
 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.
   


Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Yes, I see.  Makes sense.  It is a bit hard to see a bad case for your 
proposal in that light. Here is one other example; I'm not sure whether 
it presents difficulties or not, and may be a bit contrived, but hey, 
food for thought at least:


Say you have set up synonyms between names and commonly-used pseudonyms 
or alternate names that should not be stemmed:


Malcolm X = Malcolm Little
Prince = Rogers Nelson Prince
Little Kim = Kimberly Denise Jones
Biggy Smalls etc.

You don't want Malcolm Littler or Littlest Kim or Big Small to 
match anything. And Princely shouldn't bring up the artist.


But you also have regular linguistic synonyms (not names) that *should* 
be stemmed (as in the original example).  So little = small should 
imply littler = smaller and so on via stemming.


Ideally  you could put one SynonymFilter before the stemming and the 
other one after.  In that case do the SynonymFilters get composed?  I 
can't think of a believable example where that would cause a problem, 
but maybe you can?


-Mike


On 04/26/2011 04:25 PM, Robert Muir wrote:

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter -  no lowercasing of tokens are done as it analyzes
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -  the synonyms are lowercased, as it analyzes
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the tokenizer.

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolovsoko...@ifactory.com  wrote:
   

Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens.  For example, PET might be a
synonym of positron emission tomography, but pet wouldn't be.

-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:
 

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.comwrote:


   

But somehow this feels bad (well, so does sticking word variations in
what's
supposed to be a synonyms file), partly because it means that the person
adding
new synonyms would need to know what they stem to (or always check it
against
Solr before editing the file).

 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

   
 


Re: multi-core solr, specifying the data directory

2011-03-02 Thread Mike Sokolov
Yes - I commented out the dataDir element in solrconfig.xml and then 
got the expected behavior: the core used a data subdirectory in the core 
subdirectory.


It seems like the problem arises from using the solrconfig.xml that's 
distributed as example/solr/conf/solrconfig.xml


The solrconfig.xml's in  example/multicore/ don't have the dataDir 
element.


-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:!-- Used to specify an alternate directory to hold all index data
:other than the default ./data under the Solr home.
:If replication is in use, this should match the replication
: configuration
: . --
:dataDir${solr.data.dir:./solr/data}/dataDir

that directive says use the solr.data.dir system property to pick a path,
if it is not set, use ./solr/data (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

dataDir${solr.data.dir:}/dataDir

or...

dataDir/dataDir


-Hoss
   


Re: Query question

2010-11-03 Thread Mike Sokolov

Another alternative (prettier to my eye), would be:

(city:Chicago AND Romantic AND View)^10 OR (Romantic AND View)


-Mike



On 11/03/2010 09:28 AM, kenf_nc wrote:

Unfortunately the default operator is set to AND and I can't change that at
this time.

If I do  (city:Chicago^10 OR Romantic OR View) it returns way too many
unwanted results.
If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted
results, but still a lot.
iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:*
-city:Chicago))) does seem to work. Chicago results are at the top, and the
remaining results seem to fit the other search parameters. It's an ugly
query, but does seem to do the trick for now until I master Dismax.

Thanks all!

   


Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
Right - my point was to combine this with the previous approaches to 
form a query like:


samsung AND android AND GPS AND word_count:3

in order to exclude documents containing additional words. This would 
avoid the combinatoric explosion problem otehrs had alluded to earlier. 
Of course this would fail because android is mis- spelled :)


-Mike

On 10/27/2010 08:45 AM, Steven A Rowe wrote:

I'm pretty sure the word-count strategy won't work.

   

If I search with the text samsung andriod GPS, search results
should only conain samsung, GPS, andriod and samsung andriod.
 

Using the word-count strategy, a document containing samsung andriod PDQ 
would be a hit, but Varun doesn't want it, because it contains a word that is not in the 
query.

Steve

   

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com]
Sent: Wednesday, October 27, 2010 7:44 AM
To: solr-user@lucene.apache.org
Subject: RE: How do I this in Solr?

You might try adding a field containing the word count and making sure
that
matches the query's word count?

This would require you to tokenize the query and document yourself,
perhaps.

-Mike

 

-Original Message-
From: Varun Gupta [mailto:varun.vgu...@gmail.com]
Sent: Tuesday, October 26, 2010 11:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How do I this in Solr?

Thanks everybody for the inputs.

Looks like Steven's solution is the closest one but will lead
to performance issues when the query string has many terms.

I will try to implement the two filters suggested by Steven
and see how the performance matches up.

--
Thanks
Varun Gupta


On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
scott@udngroup.comwrote:

   

I think you have to write a yet exact match handler
 

yourself (I mean
   

yet cause it's not quite exact match we normally know).
 

Steve's answer
   

is quite near your request. You can do further work based
 

on his solution.
   

At the last step, I'll suggest you eat up all blank within query
string and query result, respevtively  only returns those results
that has equal string length as the query string's.

For example, giving:
*query string = Samsung with GPS
*query results:
resutl 1 = Samsung has lots of mobile with GPS
result 2 = with GPS Samsng
result 3 = GPS mobile with vendors, such as Sony, Samsung

they become:
*query result = SamsungwithGPS (length =14) *query results:
resutl 1 = SamsunghaslotsofmobilewithGPS (length =29) result 2 =
withGPSSamsng (length =14) result 3 =
GPSmobilewithvendors,suchasSony,Samsung (length =43)

so result 2 matches your request.

In this way, you can avoid case-sensitive,
 

word-order-rearrange load
   

of works. Furthermore, you can do refined work, such as
 

remove white
   

characters, etc.

Scott @ Taiwan


- Original Message - From: Varun Gupta
varun.vgu...@gmail.com

To:solr-user@lucene.apache.org
Sent: Tuesday, October 26, 2010 9:07 PM

Subject: How do I this in Solr?


  Hi,
 

I have lot of small documents (each containing 1 to 15
   

words) indexed
   

in Solr. For the search query, I want the search results
   

to contain
   

only those documents that satisfy this criteria All of
   

the words of
   

the search result document are present in the search query

For example:
If I have the following documents indexed: nokia n95, GPS,
android, samsung, samsung andriod, nokia andriod,
   

mobile with GPS
   

If I search with the text samsung andriod GPS, search results
should only conain samsung, GPS, andriod and
   

samsung andriod.
   

Is there a way to do this in Solr.

--
Thanks
Varun Gupta


   



 

--
   

--



%b6G$J0T.'$$'d(l/f,r!C
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date:
10/26/10 14:34:00


 
   
   


Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
Yes I missed that requirement (as Steven also pointed out in a private 
e-mail).  I now agree that the combinatorics are required.


Another possibility to consider (if the queries are large, which 
actually seems unlikely) is to use the default behavior where all terms 
are optional, sort by relevance, and truncate the result list on the 
client side after some unwanted term is found.  I *think* the scoring 
should find only docs with the searched-for terms first, although if 
there are a lot of repeated terms maybe not? Also result counts will be 
screwy.


-Mike

On 10/27/2010 09:34 AM, Toke Eskildsen wrote:

That does not work either as it requires that all the terms in the query
are present in the document. The original poster did not state this
requirement. On the contrary, his examples were mostly single-word
matches, implying an OR-search at the core.

The query-explosion still seems like the only working idea. Maybe Varun
could comment on the maximum numbers of terms that his queries will
contain?

Regards,
Toke Eskildsen

On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote:
   

Right - my point was to combine this with the previous approaches to
form a query like:

samsung AND android AND GPS AND word_count:3

in order to exclude documents containing additional words. This would
avoid the combinatoric explosion problem otehrs had alluded to earlier.
Of course this would fail because android is mis- spelled :)

-Mike

On 10/27/2010 08:45 AM, Steven A Rowe wrote:
 

I'm pretty sure the word-count strategy won't work.


   

If I search with the text samsung andriod GPS, search results
should only conain samsung, GPS, andriod and samsung andriod.

 

Using the word-count strategy, a document containing samsung andriod PDQ 
would be a hit, but Varun doesn't want it, because it contains a word that is not in the 
query.

Steve


   

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com]
Sent: Wednesday, October 27, 2010 7:44 AM
To: solr-user@lucene.apache.org
Subject: RE: How do I this in Solr?

You might try adding a field containing the word count and making sure
that
matches the query's word count?

This would require you to tokenize the query and document yourself,
perhaps.

-Mike


 

-Original Message-
From: Varun Gupta [mailto:varun.vgu...@gmail.com]
Sent: Tuesday, October 26, 2010 11:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How do I this in Solr?

Thanks everybody for the inputs.

Looks like Steven's solution is the closest one but will lead
to performance issues when the query string has many terms.

I will try to implement the two filters suggested by Steven
and see how the performance matches up.

--
Thanks
Varun Gupta


On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???)
scott@udngroup.comwrote:


   

I think you have to write a yet exact match handler

 

yourself (I mean

   

yet cause it's not quite exact match we normally know).

 

Steve's answer

   

is quite near your request. You can do further work based

 

on his solution.

   

At the last step, I'll suggest you eat up all blank within query
string and query result, respevtively   only returns those results
that has equal string length as the query string's.

For example, giving:
*query string = Samsung with GPS
*query results:
resutl 1 = Samsung has lots of mobile with GPS
result 2 = with GPS Samsng
result 3 = GPS mobile with vendors, such as Sony, Samsung

they become:
*query result = SamsungwithGPS (length =14) *query results:
resutl 1 = SamsunghaslotsofmobilewithGPS (length =29) result 2 =
withGPSSamsng (length =14) result 3 =
GPSmobilewithvendors,suchasSony,Samsung (length =43)

so result 2 matches your request.

In this way, you can avoid case-sensitive,

 

word-order-rearrange load

   

of works. Furthermore, you can do refined work, such as

 

remove white

   

characters, etc.

Scott @ Taiwan


- Original Message - From: Varun Gupta
varun.vgu...@gmail.com

To:solr-user@lucene.apache.org
Sent: Tuesday, October 26, 2010 9:07 PM

Subject: How do I this in Solr?


   Hi,

 

I have lot of small documents (each containing 1 to 15

   

words) indexed

   

in Solr. For the search query, I want the search results

   

to contain

   

only those documents that satisfy this criteria All of

   

the words of

   

the search result document are present in the search query

For example:
If I have the following documents indexed: nokia n95, GPS,
android, samsung, samsung andriod, nokia andriod,

   

mobile with GPS

   

If I search with the text samsung andriod GPS, search results
should only conain samsung, GPS, andriod and

   

samsung andriod.

   

Is there a way to do this in Solr.

--
Thanks
Varun Gupta

Re: different results depending on result format

2010-10-22 Thread Mike Sokolov
Yes - I really only have the one solr instance.  And I have plenty of 
other cases where I am getting good results back via solrj.  It's really 
a mystery.  Unfortunately I have to catch up on other stuff I have been 
neglecting, but I'll follow up when I'm able to get a solution...


-Mike


On 10/22/2010 06:58 AM, Savvas-Andreas Moysidis wrote:

strange..are you absolutely sure the two queries are directed to the same
Solr instance? I'm running the same query from the admin page (which
specifies the xml format) and I get the exact same results as solrj.

On 21 October 2010 22:25, Mike Sokolovsoko...@ifactory.com  wrote:

   

quick follow-up: I also notice that the query from solrj gets version=1,
whereas the admin webapp puts version=2.2 on the query string, although this
param doesn't seem to change the xml results at all.  Does this indicate an
older version of solrj perhaps?

-Mike


On 10/21/2010 04:47 PM, Mike Sokolov wrote:

 

I'm experiencing something really weird: I get different results depending
on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml.  I
spent quite a while staring at query params to make sure everything else is
the same, and they do seem to be.  At first I thought the problem related to
the javabin format change that has been talked about recently, but I am
using solr 1.4.0 and solrj 1.4.0.

Notice in the two entries that the wt param is different and the hits
result count is different.

Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select/
params={wt=xmlrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1}
hits=261 status=0 QTime=1
Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select
params={wt=javabinrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1}
hits=57 status=0 QTime=0


The xml format results seem to be the correct ones. So one thought I had
is that I could somehow fall back to using xml format in solrj, but I tried
SolrQuery.set('wt','xml') and that didn't have the desired effect (I get
'wt=javabinwt=javabin' in the log - ie the param is repeated, but still
javabin).


Am I crazy? Is this a known issue?

Thanks for any suggestions


   
   


Re: different results depending on result format

2010-10-22 Thread Mike Sokolov
OK I solved the problem.  It turns out that I was connecting to the 
server using its FQDN (rosen.ifactory.com).  When, instead, I connect to 
it using the name rosen (which maps to the same IP using the default 
domain name configured in my resolver, ifactory.com), I get results back.


I am looking into the virtual hosts config in tomcat; it seems as if 
there must indeed be another solr instance running; in fact I'm now 
concerned there might be two solr instances running against the same 
data folder. yargh.


-Mike


On 10/22/2010 09:05 AM, Mike Sokolov wrote:
Yes - I really only have the one solr instance.  And I have plenty of 
other cases where I am getting good results back via solrj.  It's 
really a mystery.  Unfortunately I have to catch up on other stuff I 
have been neglecting, but I'll follow up when I'm able to get a 
solution...


-Mike


On 10/22/2010 06:58 AM, Savvas-Andreas Moysidis wrote:
strange..are you absolutely sure the two queries are directed to the 
same

Solr instance? I'm running the same query from the admin page (which
specifies the xml format) and I get the exact same results as solrj.

On 21 October 2010 22:25, Mike Sokolovsoko...@ifactory.com  wrote:

quick follow-up: I also notice that the query from solrj gets 
version=1,
whereas the admin webapp puts version=2.2 on the query string, 
although this
param doesn't seem to change the xml results at all.  Does this 
indicate an

older version of solrj perhaps?

-Mike


On 10/21/2010 04:47 PM, Mike Sokolov wrote:

I'm experiencing something really weird: I get different results 
depending
on whether I specify wt=javabin, and retrieve using SolrJ, or 
wt=xml.  I
spent quite a while staring at query params to make sure everything 
else is
the same, and they do seem to be.  At first I thought the problem 
related to
the javabin format change that has been talked about recently, but 
I am

using solr 1.4.0 and solrj 1.4.0.

Notice in the two entries that the wt param is different and the hits
result count is different.

Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select/
params={wt=xmlrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1} 


hits=261 status=0 QTime=1
Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select
params={wt=javabinrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1} 


hits=57 status=0 QTime=0


The xml format results seem to be the correct ones. So one thought 
I had
is that I could somehow fall back to using xml format in solrj, but 
I tried
SolrQuery.set('wt','xml') and that didn't have the desired effect 
(I get
'wt=javabinwt=javabin' in the log - ie the param is repeated, but 
still

javabin).


Am I crazy? Is this a known issue?

Thanks for any suggestions




Re: different results depending on result format

2010-10-21 Thread Mike Sokolov
quick follow-up: I also notice that the query from solrj gets version=1, 
whereas the admin webapp puts version=2.2 on the query string, although 
this param doesn't seem to change the xml results at all.  Does this 
indicate an older version of solrj perhaps?


-Mike

On 10/21/2010 04:47 PM, Mike Sokolov wrote:
I'm experiencing something really weird: I get different results 
depending on whether I specify wt=javabin, and retrieve using SolrJ, 
or wt=xml.  I spent quite a while staring at query params to make sure 
everything else is the same, and they do seem to be.  At first I 
thought the problem related to the javabin format change that has been 
talked about recently, but I am using solr 1.4.0 and solrj 1.4.0.


Notice in the two entries that the wt param is different and the hits 
result count is different.


Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select/ 
params={wt=xmlrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1} 
hits=261 status=0 QTime=1

Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute
INFO: [bopp.ba] webapp=/solr path=/select 
params={wt=javabinrows=20start=0facet=truefacet.field=ref_taxid_msq=*:*fl=uri,meta_ssversion=1} 
hits=57 status=0 QTime=0



The xml format results seem to be the correct ones. So one thought I 
had is that I could somehow fall back to using xml format in solrj, 
but I tried SolrQuery.set('wt','xml') and that didn't have the desired 
effect (I get 'wt=javabinwt=javabin' in the log - ie the param is 
repeated, but still javabin).



Am I crazy? Is this a known issue?

Thanks for any suggestions



Re: How to delete a SOLR document if that particular data doesnt exist in DB?

2010-10-20 Thread Mike Sokolov
Since you are performing a complete reload of all of your data, I don't 
understand why you can't create a new core, load your new data, swap 
your application to look at the new core, and then erase the old one, if 
you want.


Even so, you could track the timestamps on all your documents, which 
will be updated when you update the content.  Then when you're done you 
could delete anything with a timestamp prior to the time you started the 
latest import.


-Mike

On 10/20/2010 11:59 AM, bbarani wrote:

ironicnet,

Thanks for your reply.

We actually use virtual DB modelling tool to fetch the data from various
sources during run time hence we dont have any control over the source.

We consolidate the data from more than one source and index the consolidated
data using SOLR. We dont have any kind of update / access rights to source
data.

Thanks.
Barani