Re: WordDelimiterFilterFactory + CamelCase query

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:22 PM, Peter Karich peat...@yahoo.de wrote:

 Hi,

 Please add preserveOriginal=1  to your WDF [1] definition and reindex
 (or
 just try with the analysis page).

 but it is already there!?

 filter class=solr.WordDelimiterFilterFactory protected=protwords.txt
                         generateWordParts=1 generateNumberParts=1
 catenateAll=0 preserveOriginal=1/


 Regards,
 Peter.


Peter,

I recently had this issue, and I had to set splitOnCaseChange=0 to
keep the word delimiter filter from doing what you describe. Can you
try that and see if it helps?

- Ken


Re: Reindex Solr Using Tomcat

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:33 PM, Eric Martin e...@makethembite.com wrote:
 Hi,



 I searched google and the wiki to find out how I can force a full re-index
 of all of my content and I came up with zilch. My goal is to be able to
 adjust the weight settings, re-index  my entire database and then search my
 site and view the results of my weight adjustments.



 I am using Tomcat 5.x and Solr 1.4.1. Weird how I couldn't find this info. I
 must have missed it. Anyone know where to find it?



 Eric


Eric,

How you re-index SOLR determines which method you wish to use. You can
either use the UpdateHandler using a POST of an XML file [1], or you
can use the DataImportHandler (DIH) [2]. There exist other means, but
these two should be sufficient to get started. How did you import your
initial index in the first place?

[1] http://wiki.apache.org/solr/UpdateXmlMessages
[2] http://wiki.apache.org/solr/DataImportHandler


Re: Reindex Solr Using Tomcat

2010-11-18 Thread Ken Stanley
On Thu, Nov 18, 2010 at 3:42 PM, Eric Martin e...@makethembite.com wrote:
 Ah, I am using an ApacheSolr module in Drupal and used nutch to insert the 
 data into the Solr index. When I using Jetty I could just delete the data 
 contents in sshd and then restart the service forcing the reindex.

 Currently, the ApacheSolr module for Drupal allows for a 200 record re-index 
 every cron run, but that is too slow for me. During implantation and testing 
 I would prefer to re-index the entire database as I have over 400k records.

 I appreciate your help. My mind was searching for a command on the CLI that 
 would just tell solr to reindex the entire dbase and be done with it.


Eric,

From what I could find, this looks to be your best bet:
http://drupal.org/node/267543.

- Ken


Re: How do I format this query with 2 search terms?

2010-11-17 Thread Ken Stanley
2010/11/17 Jón Helgi Jónsson jonjons...@gmail.com:
 I'm using index time boosting and need to specify every field I want
 to search (not use copy fields) or else the boosting wont work.

 This query with 1 saerchterm works fine, boosts look good:

 http://localhost:8983/solr/select/?
 q=companyName:foo
 +descriptionTxt:verslun
 fl=*%20scorerows=10start=0

 However if I have 2 words in the query and do it like this boosting
 seems not to be working

 http://localhost:8983/solr/select/?
 q=companyName:foo+bar
 +descriptionTxt:foo+bar
 fl=*%20scorerows=10start=0

 Its probably using the default search field for the second word which
 has no boosting configured. How do I go about this?

 Thanks,
 Jon


Jon,

You have a few options here, depending on what you want to achieve
with your query:

1. If you're trying to do a phrase query, you simply need to ensure
that your phrases are quoted. The default behavior in SOLR is to split
the phrase into multiple chunks. If a word is not preceded with a
field definition, then SOLR will automatically apply the word(s) as if
you had specified the default field. So for your example, SOLR would
parse your query into companyName:foo defaultField:bar
descriptionTxt:foo defaultField:bar.
2. You can use the dismax query plugin instead of the standard query
plugin. You simply configure the dismax section of your solrconfig.xml
to your liking - you define which fields to search, apply any special
boosts for your needs, etc
(http://wiki.apache.org/solr/DisMaxQParserPlugin) - and then you
simply feed the query terms without naming your fields (i.e.,
q=foo+bar), along with telling SOLR to use dismax (i.e.,
qt=whatever_you_named_your_dismax_handler).
3. If phrase queries are not important to you, you can manually prefix
each term in your query with the field you wish to search; for
example, you would do companyName:foo companyName:bar
descriptionTxt:foo descriptionTxt:bar.

Whichever way you decide to go, the best thing that you can do to
understand SOLR and how it's working in your environment is to append
debugQuery=on to the end of your URL; this tells SOLR to output
information about how it parsed your query, how long each component
took to run, and some other useful debugging information. It's very
useful, and has come in handy several times here where I'm at when I
wanted to know why SOLR returned the results (or didn't return) that I
expected.

I hope this helps.

- Ken


Re: ranged and boolean query

2010-11-17 Thread Ken Stanley
On Wed, Nov 17, 2010 at 10:39 AM, Peter Blokland pe...@desk.nl wrote:
 hi.

 i'm using solr and am trying to limit my resultset to documents
 that either have a publication date in the range * to now, or
 have no publication date set at all (field is not present).
 however, using this :

 (pubdate:[* TO NOW]) OR ( NOT pubdate:*)

 gives me only the documents in the range * to now (reversing the
 two clauses has no effect). using only

 NOT pubdate:*

 gives me the correct set of documents (those not having a pubddate).
 any reason the OR does not work in this case ?

 ps: also tried it like this :

 pubdate:([* TO NOW] OR (NOT *))

 which gives the same result.


 --
 CUL8R, Peter.

 www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™


Peter,

Instead of using NOT, try simply prefixing the field name with a minus
sign. This tells SOLR to exclude the field. Otherwise, the word NOT
would be treated as a term, and would be applied against your default
field (which may or may not affect your results). So instead of
(pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[*
TO NOW]) OR ( -pubdate:*).

- Ken


Re: ranged and boolean query

2010-11-17 Thread Ken Stanley
On Wed, Nov 17, 2010 at 11:00 AM, Peter Blokland pe...@desk.nl wrote:
 hi,

 On Wed, Nov 17, 2010 at 10:54:48AM -0500, Ken Stanley wrote:

  pubdate:([* TO NOW] OR (NOT *))

 Instead of using NOT, try simply prefixing the field name with a minus
 sign. This tells SOLR to exclude the field. Otherwise, the word NOT
 would be treated as a term, and would be applied against your default
 field (which may or may not affect your results). So instead of
 (pubdate:[* TO NOW]) OR ( NOT pubdate:*), you would write (pubdate:[*
 TO NOW]) OR ( -pubdate:*).

 tried that, it gives me exactly the same result... I can't really
 figure out what's going on.

 --
 CUL8R, Peter.

 www.desk.nl --- Sent from my NetBSD-powered Talkie Toaster™


If you append your URL with debugQuery=on, it will tell you how SOLR
parsed your query. What's your schema look like? And what does the
debug query look like?


Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan iori...@yahoo.com wrote:
 For (1) you probably need to write a custom transformer. Something like:
 public Object transformRow(MapString, Object row)     {
 String language_code = row.get(language_code);
 String text = row.get(text);
 if(en.equals(language_code))
       row.put(text_en, text);
 else if if(fr.equals(language_code))
       row.put(text_fr, text);

 return row;
 }


 For (2), it doable with regex transformer.

 field column=mailId splitBy=, sourceColName=emailids/
 The 'emailids' field in the table can be a comma separated value. So it ends 
 up giving out one or more than one email ids and we expect the 'mailId' to be 
 a multivalued field in Solr. [1]

 [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer


In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:

entity name=english query=SELECT * FROM documents WHERE
language_code='en' transformer=RegexTransformer
field column=text_en column=text /
field column=tags column=tags splitBy=, /
/entity

entity name=french query=SELECT * FROM documents WHERE
language_code='fr' transformer=RegexTransformer
field column=text_fr column=text /
field column=tags column=tags splitBy=, /
/entity

entity name=chinese query=SELECT * FROM documents WHERE
language_code='zh' transformer=RegexTransformer
field column=text_zh column=text /
field column=tags column=tags splitBy=, /
/entity

But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.

I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:

   CREATE TABLE documents (
   id INT NOT NULL AUTO_INCREMENT,
   language_code CHAR(2),
   tags CHAR(30),
   text TEXT,
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tags (
   id INT NOT NULL AUTO_INCREMENT,
   tag CHAR(30),
   PRIMARY KEY (id)
   );

   CREATE TABLE document_tag_lookup (
   document_id INT NOT NULL,
   tag_id INT NOT NULL,
   PRIMARY KEY (document_id, tag_id)
   );

Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
english entity from above:

entity name=english query=SELECT * FROM documents WHERE
language_code='en' transformer=RegexTransformer
field name=text_en column=text /

entity name=english_tags query=SELECT * FROM document_tags dt
INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND
dtl.document_id='${english.id}')
field name=tags column=tag /
/entity
/entity

This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.

[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

- Ken


Re: DIH for multilingual index multiValued field?

2010-11-13 Thread Ken Stanley
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley doh...@gmail.com wrote:
   CREATE TABLE documents (
       id INT NOT NULL AUTO_INCREMENT,
       language_code CHAR(2),
       tags CHAR(30),
       text TEXT,
       PRIMARY KEY (id)
   );

I apologize, but I couldn't leave the typo in my last post without a
follow up; it might cause confusion. I copied the OP's original table
definition and forgot to remove the tags field. My purposed definition
for the documents table should be:

  CREATE TABLE documents (
  id INT NOT NULL AUTO_INCREMENT,
  language_code CHAR(2),
  text TEXT,
  PRIMARY KEY (id)
  );

- Ken


Re: scheduling imports and heartbeats

2010-11-10 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:16 PM, Tri Nguyen tringuye...@yahoo.com wrote:
 Hi,

 Can I configure solr to schedule imports at a specified time (say once a day,
 once an hour, etc)?

 Also, does solr have some sort of heartbeat mechanism?

 Thanks,

 Tri

Tri,

If you use the DataImportHandler (DIH), you can set up a
dataimport.properties file that can be configured to import on
intervals.

http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example

As for heartbeat, you can use the ping handler (default is
/admin/ping) to check the status of the servlet.

- Ken


Re: Best practice for emailing this list?

2010-11-10 Thread Ken Stanley
On Wed, Nov 10, 2010 at 1:11 PM, robo - robom...@gmail.com wrote:
 How do people email this list without getting spam filter problems?


Depends on which side of the spam filter that you're referring to.
I've found that to keep these emails from entering my spam filter is
to add a rule to Gmail that says Never send to spam. As for when I
send emails, I make sure that I send my emails as plain text to avoid
getting bounce backs.

- Ken


Re: dynamically create unique key

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:39 AM, Christopher Gross cogr...@gmail.com wrote:
 I'm trying to use Solr to store information from a few different sources in
 one large index.  I need to create a unique key for the Solr index that will
 be unique per document.  If I have 3 systems, and they all have a document
 with id=1, then I need to create a uniqueId field in my schema that
 contains both the system name and that id, along the lines of: sysa1,
 sysb1, and sysc1.  That way, each document will have a unique id.

 I added this to my schema.xml:

  copyField source=source dest=uniqueId/
  copyField source=id dest=uniqueId/


 However, after trying to insert, I got this:
 java.lang.Exception: ERROR: multiple values encountered for non multiValued
 copy field uniqueId: sysa

 So instead of just appending to the uniqueId field, it tried to do a
 multiValued.  Does anyone have an idea on how I can make this work?

 Thanks!

 -- Chris


Chris,

Depending on how you insert your documents into SOLR will determine
how to create your unique field. If you are POST'ing the data via
HTTP, then you would be responsible for building your unique id (i.e.,
your program/language would use string concatenation to add the unique
id to the output before it gets to the update handler in SOLR). If
you're using the DataImportHandler, then you can use the
TemplateTransformer
(http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer) to
dynamically build your unique id at document insertion time.

For example, we here at bizjournals use SOLR and the DataImportHandler
to index our documents. Like you, we run the risk of two or more ids
clashing, and thus overwriting a different type of document. As such,
we take two or three different fields and combine them together using
the TemplateTransformer to generate a more unique id for each document
we index.

With respect to the multiValued option, that is used more for an
array-like structure within a field. For example, if you have a blog
entry with multiple tag keywords, you would probably want a field in
SOLR that can contain the various tag keywords for each blog entry;
this is where multiValued comes in handy.

I hope that this helps to clarify things for you.

- Ken Stanley


Re: dynamically create unique key

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 10:53 AM, Christopher Gross cogr...@gmail.com wrote:
 Thanks Ken.

 I'm using a script with Java/SolrJ to copy documents from their original
 locations into the Solr Index.

 I wasn't sure if the copyField would help me, but from your answers it seems
 that I'll have to handle it on my own.  That's fine -- it is definitely not
 hard to pass a new field myself.  I was just thinking that there should be
 an easy way to have Solr build the unique field, since it was getting
 everything anyway.

 I was just confused as to why I was getting a multiValued error, since I was
 just trying to append to a field.  I wasn't sure if I was missing something.

 Thanks again!

 -- Chris


Chris,

I definitely understand your sentiment. The thing to keep in mind with
SOLR is that it really has limited logic mechanisms; in fact, unless
you're willing to use the DataImportHandler (dih) and the
ScriptTransformer, you really have no logic.

The copyField directive in schema.xml is mainly used to help you
easily copy the contents of one field into another so that it may be
indexed in multiple ways; for example, you can index a string so that
it is stored literally (i.e., Hello World), parsed using a
whitespace tokenizer (i.e., Hello, World), parsed for an nGram
tokenizer (i.e., H, He, Hel... ). This is beneficial to you
because you wouldn't have to explicitly define each possible instance
in your data stream. You just define the field once, and SOLR is smart
enough to copy it where it needs to go.

Glad to have helped. :)

- Ken


Re: spell check vs terms component

2010-11-09 Thread Ken Stanley
On Tue, Nov 9, 2010 at 1:02 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 8:20 AM, bbarani bbar...@gmail.com wrote:


 Hi,

 We are trying to implement auto suggest feature in our application.

 I would like to know the difference between terms vs spell check component.

 Both the handlers seems to display almost the same output, can anyone let
 me
 know the difference and also I would like to know when to go for spell
 check
 and when to go for terms component.


 SpellCheckComponent is designed to operate on whole words and not partial
 words so I don't know how well it will work for auto-suggest, if at all.

 As far as differences between SpellCheckComponent and Terms Component is
 concerned, TermsComponent is a straight prefix match whereas SCC takes edit
 distance into account. Also, SCC can deal with phrases composed of multiple
 words and also gives back a collated suggestion.

 --
 Regards,
 Shalin Shekhar Mangar.


An alternative to using the SpellCheckComponent and/or the
TermsComponent, would be the (Edge)NGrams filter. Basically, this
filter breaks words down into auto-suggest-friendly tokens (i.e.,
Hello = H, He, Hel, Hell, Hello) that works great for
auto suggestion querying.

Here is an article from Lucid Imagination on using the ngram filter:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
Here is the SOLR wiki entry for the filter:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

- Ken Stanley


Re: Fixed value in dataimporthandler

2010-11-08 Thread Ken Stanley
On Mon, Nov 8, 2010 at 3:50 PM, Renato Wesenauer
renato.wesena...@gmail.com wrote:

 Hi Ahmet Arslan,

 I'm using this in schema.xml:
 field name=secao type=cleannormalized_text indexed=true
 stored=true/
 field name=indativo type=boolean indexed=true stored=true/

 I'm using this in dataimporthandler:
 field column=secao xpath=/ROW/NomeSecaoMix /
 field column=indativo template=0 /

 The indexing process work correctly, but it's happening something wrong with
 the results of queries.

 All queries with some field with 2 words or more, plus the field
 indativo:true, it isn't returning any result.

 Example of queries:

 1º) secao:accessories for cars AND indativo:true
 2º) secao:accessories for cars AND indativo:false

 The first query returns 0 results, but there are 40.000 documents indexed
 with these fields.
 The second query returns 300.000 documents, but 300.000 is the total of
 documents for query secao:celular e telefonia, the correct would be
 260.000.

 Another example:
 1º) secao:toys AND indativo:true
 2º) secao:toys AND indativo:false

 In this example, the two queries work correctly.

 The problem happens with values with 2 words or more, plus the indativo
 field.

 Do you know what can be happening?

 Thank you,

 Renato F. Wesenauer


Renato,

Correct me if I'm wrong, but you have an entity that you explicitly
set to a false value for the indativo field. And when you query, is
your intention to find the fields that were not indexed through that
entity? The way that I am reading your question is that you are
expecting the indativo field to be true by default, but I do not see
where you're explicitly stating that in your schema. The reason that I
bring this up is - and I could be wrong - I would think that if you do
not set a value in SOLR, then it doesn't exist (either in the schema,
or during indexing). If you are expecting the other entries where
indativo was explicitly set to false to be true, you might need to
tweak your schema so that the field definition is by default true.
Is it possible to try adding the default attribute to your field
definition and reindexing to see if that gives you what you're looking
for?

- Ken Stanley

PS. If this came through twice, I apologize; I got a bounce-back
saying my original reply was blocked, so I'm trying to re-send as
plain text.


Re: Tomcat special character problem

2010-11-07 Thread Ken Stanley
On Sun, Nov 7, 2010 at 9:11 AM, Em mailformailingli...@yahoo.de wrote:


 Hi List,

 I got an issue with my Solr-environment in Tomcat.
 First: I am not very familiar with Tomcat, so it might be my fault and not
 Solr's.

 It can not be a solr-side configuration problem, since everything worked
 fine with my local Jetty-servlet container.

 However, when I deploy into Tomcat, several special characters were shown
 in
 their utf-8 representation.

 Example:
 göteburg will be displayed as str name=qgöteburg/str when it comes
 to
 search.

 I tried the following within my server.xml-file

Connector port=8080 protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443
   URIEncoding=UTF-8 /

 And restarted Tomcat afterwards.

 The problem only occurs when I try to search for something.
 It is no problem to index that data.

 Thank you for any help!

 Regards,
 Em
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html
 Sent from the Solr - User mailing list archive at Nabble.com.


That is definitely odd. When I tried copying göteburg and doing a manual
query in my web browser, everything worked. How are you making the request
to SOLR? When I viewed the properties/info of the results, my returned
charset was in UTF-8. Can you confirm similar for you?

When I grepped for UTF-8 in both my SOLR and Tomcat configs, nothing stood
out as a special configuration option.


Re: Tomcat special character problem

2010-11-07 Thread Ken Stanley
On Sun, Nov 7, 2010 at 9:34 AM, Em mailformailingli...@yahoo.de wrote:


 Hi Ken,

 thank you for your quick answer!

 To make sure that there occurs no mistakes at my application's side, I send
 my requests with the form that is available at solr/admin/form.jsp

 I changed almost nothing from the example-configurations within the
 example-package except some auto-commit params.

 All the special-characters within the results were displayed correctly, and
 so far they were also indexed correctly.
 The only problem is querying with special-characters.

 I can confirm that the page is encoded in UTF-8 within my browser.

 Is there a possibility that Tomcat did not use the UTF-8 URIEncoding?
 Maybe I should say that Tomcat is behind an Apache HttpdServer and is
 mounted by a jk_mount.

 Thank you!


I am not familiar with using your type of set up, but a quick Google search
suggested using a second connector on a different port. If you're using
mod_jk, you can try setting JkOptions +ForwardURICompatUnparsed to see if
that helps. (
http://markstechstuff.blogspot.com/2008/02/utf-8-problem-between-apache-and-tomcat.html).
Sorry I couldn't have been more help. :)

- Ken


Re: querying multiple fields as one

2010-11-04 Thread Ken Stanley
On Thu, Nov 4, 2010 at 8:21 AM, Tommaso Teofili
tommaso.teof...@gmail.comwrote:

 Hi all,
 having two fields named 'type' and 'cat' with identical type and options,
 but different values recorded, would it be possible to query them as they
 were one field?
 For instance
  q=type:electronics cat:electronics
 should return same results as
  q=common:electronics
 I know I could make it defining a third field 'common' with copyFields from
 'type' and 'cat' to 'common' but this wouldn't be feasible if you've
 already
 lots of documents in your index and don't want to reindex everything, isn't
 it?
 Any suggestions?
 Thanks in advance,
 Tommaso


Tommaso,

If re-indexing is not feasible/preferred, you might try looking into
creating a dismax handler that should give you what you're looking for in
your query: http://wiki.apache.org/solr/DisMaxQParserPlugin. The same
solrconfig.xml that comes with SOLR has a dismax parser that you can modify
to your needs.

- Ken Stanley


Re: Phrase Query Problem?

2010-11-02 Thread Ken Stanley
On Tue, Nov 2, 2010 at 8:19 AM, Erick Erickson erickerick...@gmail.comwrote:

 That's not the response I get when I try your query, so I suspect
 something's not quite right with your test...

 But you could also try putting parentheses around the words, like
 mykeywords:(Compliance+With+Conduct+Standards)

 Best
 Erick


I agree with Erick, your query string showed quotes, but your parsed query
did not. Using quotes, or parenthesis, would pretty much leave your query
alone. There is one exception that I've found: if you use a stopword
analyzer, any stop words would be converted to ? in the parsed query. So if
you absolutely need every single word to match, regardless, you cannot use a
field type that uses the stop word analyzer.

For example, I have two dynamic field definitions: df_text_* that does the
default text transformations (including stop words), and df_text_exact_*
that does nothing (field type is string). When I run the
query df_text_exact_company_name:Bank of America OR
df_text_company_name:Bank of America, the following is shown as my
query/parsed query when debugQuery is on:

str name=rawquerystring
df_text_exact_company_name:Bank of America OR df_text_company_name:Bank
of America
/str
str name=querystring
df_text_exact_company_name:Bank of America OR df_text_company_name:Bank
of America
/str
str name=parsedquery
df_text_exact_company_name:Bank of America
PhraseQuery(df_text_company_name:bank ? america)
/str
str name=parsedquery_toString
df_text_exact_company_name:Bank of America df_text_company_name:bank ?
america
/str

The difference is subtle, but important. If I were to do
df_text_company_name:Bank and America, I would still match Bank of
America. These are things that you should keep in mind when you are
creating fields for your indices.

A useful tool for seeing what SOLR does to your query terms is the Analysis
tool found in the admin panel. You can do an analysis on either a specific
field, or by a field type, and you will see a breakdown by Analyzer for
either the index, query, or both of any query that you put in. This would
definitely be useful when trying to determine why SOLR might return what it
does.

- Ken


Highlighting and maxBooleanClauses limit

2010-11-02 Thread Ken Stanley
 parser (even though the highlighting query is built internally)?

I am not a SOLR expert by any measure of the word, and as such, I just don't
understand how two words on one field (as noted by the use of
hl.fl=df_text_content + hl.requireFieldMatch=true +
hl.usePhraseHighlighter=true) could somehow exceed the limits of both 1024
and 2048. I am concerned that even if I continue increasing
maxBooleanClauses, I am not actually solving anything; in fact, my concern
is that if I were to keep increasing this limit, I am in fact begging for
problems later on down the road.

For the sake of completeness, here are the definitions of the field I'm
highlighting on (schema.xml):

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitO
nCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt /
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms/synonyms.txt ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitO
nCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt /
/analyzer
/fieldType

dynamicField name=df_text_* type=text indexed=true
stored=true /

solrQueryParser defaultOperator=OR /

And here is my highlighter definition (solrconfig.xml):

highlighting
!-- Configure the standard fragmenter --
!-- This could most likely be commented out in the default case
--
fragmenter name=gap
class=org.apache.solr.highlight.GapFragmenter default=true
lst name=defaults
int name=hl.fragsize255/int
/lst
/fragmenter

!-- A regular-expression-based fragmenter (f.i., for sentence
extraction) --
fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
!-- slightly smaller fragsizes work better because of slop
--
int name=hl.fragsize70/int
!-- allow 50% slop on fragment sizes --
float name=hl.regex.slop0.5/float
!-- a basic sentence pattern --
  str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
/lst
/fragmenter

!-- Configure the standard formatter --
formatter name=html
class=org.apache.solr.highlight.HtmlFormatter default=true
lst name=defaults
str name=hl.simple.pre![CDATA[em]]/str
str name=hl.simple.post![CDATA[/em]]/str
/lst
/formatter
/highlighting

It is worth noting that I have not done anything (except formatting) to the
highlighting configuration in solrconfig.xml. Any help, assistance, and/or
guidance that can be provided would be greatly appreciated.

Thank you,

Ken Stanley

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


Re: Highlighting and maxBooleanClauses limit

2010-11-02 Thread Ken Stanley
On Tue, Nov 2, 2010 at 11:26 AM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/11/02 23:14), Ken Stanley wrote:

 I've noticed in the stack trace that this exception occurs when trying to
 build the query for the highlighting; I've confirmed this by copying the
 params and changing hl=true to hl=false. Unfortunately, when using
 debugQuery=on, I do not see any details on what is going on with the
 highlighting portion of the query (after artificially increasing the
 maxBooleanClauses so the query will run).

 With all of that said, my question(s) to the list are: Is there a way to
 determine how exactly the highlighter is building its query (i.e., some
 sort
 of highlighting debug setting)?


 Basically I think highlighter uses main query, but try to rewrite it
 before highlighting.


  Is the behavior of highlighting in SOLR
 intended to be held to the same restrictions (maxBooleanClauses) as the
 query parser (even though the highlighting query is built internally)?


 I think so because maxBooleanClauses is a static variable.

 I saw your stack trace and glance at highlighter source,
 my assumption is - highlighter tried to rewrite (expand) your
 range queries to boolean query, even if you set requireFieldMatch to true.

 Can you try to query without the range query? If the problem goes away,
 I think it is highlighter bug. Highlighter should skip the range query
 when user set requireFieldMatch to true, because your range query is for
 another field. If so, please open a jira issue.

 Koji
 --
 http://www.rondhuit.com/en/


Koji, that is most excellent. Thank you for pointing out that the range
queries were causing the highlighter to exceed the maxBooleanClauses. Once I
removed them from my main query (and moved them into separate filter
queries), SOLR and highlighting worked as I expected them to work.

Per your suggestion, I have opened a JIRA ticket (SOLR-2216) for this
problem. I am somewhat a novice at Java, and I have not yet had the pleasure
of getting the SOLR sources in my working environment, but I would be more
than eager to potentially assist in finding a solution - with maybe some
mentoring from a more experienced developer.

Anyway, thank you again, I am very excited to have a suitable work around
for the time being.

- Ken Stanley


Re: Phrase Query Problem?

2010-11-01 Thread Ken Stanley
On Mon, Nov 1, 2010 at 10:26 PM, Tod listac...@gmail.com wrote:

 I have a number of fields I need to do an exact match on.  I've defined
 them as 'string' in my schema.xml.  I've noticed that I get back query
 results that don't have all of the words I'm using to search with.

 For example:


 q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json

 Should, with an exact match, return only one entry but it returns five some
 of which don't have any of the fields I've specified.  I've tried this both
 with and without quotes.

 What could I be doing wrong?


 Thanks - Tod



Tod,

Without knowing your exact field definition, my first guess would be your
first boolean query; because it is not quoted, what SOLR typically does is
to transform that type of query into something like (assuming your uniqueKey
is id): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do
(mykeywords:Compliance+With+Conduct+Standards) you might see different
(better?) results. Otherwise, append debugQuery=on to your URL and you can
see exactly how SOLR is parsing your query. If none of that helps, what is
your field definition in your schema.xml?

- Ken


Re: indexing '-

2010-10-31 Thread Ken Stanley
On Sun, Oct 31, 2010 at 12:12 PM, PeterKerk vettepa...@hotmail.com wrote:


 I have a city named 's-Hertogenbosch

 I want it to be indexed exactly like that, so 's-Hertogenbosch (without
 )

 But now I get:
 lst name=city
int name=hertogenbosch1/int
int name=s1/int
int name=shertogenbosch1/int
 /lst

 What filter should I add/remove from my field definition?

 I already tried a new fieldtype with just this, but no luck:
fieldType name=exacttext class=solr.TextField
 positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
  /analyzer
/fieldType


 My schema.xml

fieldType name=textTight class=solr.TextField
 positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_dutch.txt /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=1
 catenateNumbers=1 catenateAll=0/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=Dutch
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

 field name=city type=textTight indexed=true stored=true/






 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
 Sent from the Solr - User mailing list archive at Nabble.com.


For exact text, you should try using either the string type, or a type that
only uses the KeywordTokenizer. Other field types may perform
transformations on the text similar to what you are seeing.

- Ken


Re: If I want to move a core from one physical machine to another....

2010-10-28 Thread Ken Stanley
On Wed, Oct 27, 2010 at 6:12 PM, Ron Mayer r...@0ape.com wrote:

 If I want to move a core from one physical machine to another,
 is it as simple as just
   scp -r core5 otherserver:/path/on/other/server/
 and then adding
core name=core5name instanceDir=core5 /
 on that other server's solr.xml file and restarting the server there?



 PS: Should have I been able to figure the answer to that
out by RTFM somewhere?


Ron,

In our current environment I index all of our data on one machine, and to
save time with replication, I use scp to copy the data directory over to
our other servers. On the server that I copy from, I don't turn SOLR off,
but on the servers that I copy to, I shutdown tomcat; remove the data
directory; mv the data directory I scp'd from the source; turn tomcat back
on. I do it this way (especially with mv, versus cp) because it is the
fastest way to get the data on the other servers. And, as Gora pointed out,
you need to make sure that your configuration files match (specifically the
schema.xml) the source.

- Ken


Re: If I want to move a core from one physical machine to another....

2010-10-28 Thread Ken Stanley
On Thu, Oct 28, 2010 at 8:07 AM, Ephraim Ofir ephra...@icq.com wrote:

 How is this better than replication?

 Ephraim Ofir


It's not; for our needs here, we have not set up replication through SOLR.
We are working through OOM problems/performance tuning first, then best
practices second. I just wanted the OP to know that it can be done, and how
we do it. :)


Re: Looking for Developers

2010-10-28 Thread Ken Stanley
On Thu, Oct 28, 2010 at 2:57 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 I don't think we should do this until it becomes a real problem.

 The number of job offers is tiny compared to dev emails, so far, as
 far as I can tell.

 Mike


By the time that it becomes a real problem, it would be too late to get
people to stop spamming the -user mailing list; no?

- Ken


Re: How do I this in Solr?

2010-10-26 Thread Ken Stanley
On Tue, Oct 26, 2010 at 9:15 AM, Savvas-Andreas Moysidis 
savvas.andreas.moysi...@googlemail.com wrote:

 If I get your question right, you probably want to use the AND binary
 operator as in samsung AND andriod AND GPS or +samsung +andriod +GPS


N.b. For these queries you can also pass the q.op parameter in the request
to temporarily change the default operator to AND; this has the same effect
without having to build the query; i.e., you can just pass
http://host:port/solr/select?q=samsung+android+gpsq.op=and;
as the query string (along with any other params you need).


Re: ClassCastException Issue

2010-10-26 Thread Ken Stanley
On Mon, Oct 25, 2010 at 2:45 AM, Alex Matviychuk alex...@gmail.com wrote:

 Getting this when deploying to tomcat:

 [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():394
 Reading Solr Schema
 [INFO][http-4443-exec-3][solr.schema.IndexSchema] readSchema():408
 Schema name=tsadmin
 [ERROR][http-4443-exec-3][util.plugin.AbstractPluginLoader] log():139
 java.lang.ClassCastException: org.apache.solr.schema.StrField cannot
 be cast to org.apache.solr.schema.FieldType
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:419)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447)
at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141)
at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95)
at org.apache.solr.core.SolrCore.init(SolrCore.java:520)
at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)


 solr schema:

 ?xml version=1.0 encoding=UTF-8 ?
 schema name=tsadmin version=1.2
types
fieldType name=string class=solr.StrField
 sortMissingLast=true omitNorms=true/
...
/types
fields
   field name=type type=string required=true/
   ...
/fields
 /schema


 Any ideas?

 Thanks,
 Alex Matviychuk



Alex,

I've run into this issue myself, and it was because I tried to create a
fieldType called string (like you). Rename string to something else and
the exception should go away.

- Ken


Re: DataImporter using pure solr add XML

2010-10-25 Thread Ken Stanley
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin
dario.rigo...@comperio.itwrote:

 Looking at DataImporter I'm not sure if it's possible to import using a
 standard adddoc... xml document representing a document add operation.
 Generating adddoc is quite expensive in my application and I have
 cached
 all those documents into a text column into MySQL database.
 It will be easier for me to push all updated documents directly from
 Database instead passing via multiple xml files posted in stream mode to
 Solr.

 Thank you.

 Dario.



Dario,

Technically nothing is stopping you from using the DIH to import your XML
document(s). However, note that the docadd/add/doc structure is not
required. In fact, you can make up your own structure for the documents, so
long as you configure the DIH to recognize them. At minimum, you should be
able to use something to the effect of:

dataSource type=FileDataSource encoding=UTF-8 /

document
entity
name=some_unique_name_for_the_entity
rootEntity=false
dataSource=null
processor=FileListEntityProcessor
fileName=some_regex_matching_your_files.*\.xml$
baseDir=/path/to/xml/files

newerThan=${dataimporter.some_unique_name_for_the_entity.last_index_time}

entity
name=another_unique_entity_name
dataSource=some_unique_name_for_the_entity
processor=XPathEntityProcessor
url=${some_unique_name_for_the_entity.fileAbsolutePath}
forEach=/XMLROOT/CHILD_NODE
stream=true

   !-- An optional list of field / definitions if your XML
schema does not match that of SOLR --
/entity
/entity
/document

The break down is as follows:

The dataSource / defines the document encoding that SOLR should use for
your XML files.

The top-level entity / creates the list of files to parse (hence why the
fileName attribute supports regex expressions). The dataSource attribute
needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as
1.3 as well). The rootEntity=false  is important to tell SOLR that it
should not try to define fields from this entity.

The second-level entity / is where the documents found in the file list
are processed and parsed. The dataSource attribute needs to be the name of
the top-level entity /. The url attribute is defined as the absolute path
to the file generated by the top-level entity. The forEach is the key
component here; this is the minimum xPath needed to iterate over your
document structure. So, if by example you had:

XMLROOT
CHILD_NODE
 field1data/field1
 field2more data/field2
 ...
/CHILD_NODE
/XMLROOT

Also note that, in my experience, case sensitivity matters when parsing your
xpath instructions.

I hope this helps!

- Ken Stanley


Re: xpath processing

2010-10-23 Thread Ken Stanley
On Fri, Oct 22, 2010 at 11:52 PM, pghorp...@ucla.edu wrote:



 dataConfig
 dataSource name=myfilereader type=FileDataSource/
 document
 entity name=f rootEntity=false dataSource=null
 processor=FileListEntityProcessor fileName=.*xml recursive=true
 baseDir=C:\data\sample_records\mods\starr
 entity name=x dataSource=myfilereader processor=XPathEntityProcessor
 url=${f.fileAbsolutePath} stream=false forEach=/mods
 transformer=DateFormatTransformer,RegexTransformer,TemplateTransformer
 field column=id template=${f.file}/
 field column=collectionKey template=starr/
 field column=collectionName template=starr/
 field column=fileAbsolutePath template=${f.fileAbsolutePath}/
 field column=fileName template=${f.file}/
 field column=fileSize template=${f.fileSize}/
 field column=fileLastModified template=${f.fileLastModified}/
 field column=classification_keyword xpath=/mods/classification/
 field column=accessCondition_keyword xpath=/mods/accessCondition/
 field column=nameNamePart_s xpath=/mods/name/namepa...@type = 'date']
 /
 /entity
 /entity
 /document
 /dataConfig


The documentation says you don't need a dataSource for your
XPathEntityProcessor entity; in my configuration, I have mine set to the
name of the top-level FileListEntityProcessor. Everything else looks fine.
Can you provide one record from your data? Also, are you getting any errors
in your log?

- Ken


Re: xpath processing

2010-10-22 Thread Ken Stanley
Parinita,

In its simplest form, what does your entity definition for DIH look like;
also, what does one record from your xml look like? We need more information
before we can really be of any help. :)

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Fri, Oct 22, 2010 at 8:00 PM, pghorp...@ucla.edu wrote:

 Quoting pghorp...@ucla.edu:
 Can someone help me please?


 I am trying to import mods xml data in solr using  the xml/http datasource

 This does not work with XPathEntityProcessor of the data import handler
 xpath=/mods/name/namepa...@type = 'date']

 I actually have 143 records with type attribute as 'date' for element
 namePart.

 Thank you
 Parinita






Re: boosting injection

2010-10-19 Thread Ken Stanley
Andrea,

Using the SOLR dismax query handler, you could set up queries like this to
boost on fields of your choice. Basically, the q parameter would be the
query terms (without the field definitions, and a qf (Query Fields)
parameter that you use to define your boost(s):
http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative
would be to parse the query in whatever application is sending the queries
to the SOLR instance to make the necessary transformations.

Regards,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini 
andrea.gazzar...@atcult.it wrote:

  Hi all,
 I have a client that is sending this query

 q=title:history AND author:joyce

 is it possible to transform at runtime this query in this way:

 q=title:history^10 AND author:joyce^5

 ?

 Best regards,
 Andrea





Re: **SPAM** Re: boosting injection

2010-10-19 Thread Ken Stanley
Andrea,

Another approach, aside of Markus' suggestion, would be to create your own
handler that could intercept the query and perform whatever necessary
transformations that you need at query time. However, that would require
having Java knowledge (which I make no assumption).

Regards,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Tue, Oct 19, 2010 at 10:23 AM, Andrea Gazzarini 
andrea.gazzar...@atcult.it wrote:

  Hi Ken,
 thanks for your response...unfortunately it doesn't solve my problem.

 I cannot chnage the client behaviour so the query must be a query and not
 only the query terms.
 In this scenario, It would be great, for example, if I could declare the
 boost in the schema field definitionbut I think it's not possible isn't
 it?

 Regards
 Andrea

 --
 *From:* Ken Stanley [mailto:doh...@gmail.com]
 *To:* solr-user@lucene.apache.org
 *Sent:* Tue, 19 Oct 2010 15:05:31 +0200
 *Subject:* **SPAM** Re: boosting injection

 Andrea,

 Using the SOLR dismax query handler, you could set up queries like this to
 boost on fields of your choice. Basically, the q parameter would be the
 query terms (without the field definitions, and a qf (Query Fields)
 parameter that you use to define your boost(s):
 http://wiki.apache.org/solr/DisMaxQParserPlugin. A non-SOLR alternative
 would be to parse the query in whatever application is sending the queries
 to the SOLR instance to make the necessary transformations.

 Regards,

 Ken

 It looked like something resembling white marble, which was
 probably what it was: something resembling white marble.
 -- Douglas Adams, The Hitchhikers Guide to the Galaxy


 On Tue, Oct 19, 2010 at 8:48 AM, Andrea Gazzarini 
 andrea.gazzar...@atcult.it wrote:

  Hi all,
  I have a client that is sending this query
 
  q=title:history AND author:joyce
 
  is it possible to transform at runtime this query in this way:
 
  q=title:history^10 AND author:joyce^5
 
  ?
 
  Best regards,
  Andrea
 
 
 




Re: Documents and Cores, take 2

2010-10-19 Thread Ken Stanley
Ron,

In the past I've worked with SOLR for a product that required the ability to
search - separately - for companies, people, business lists, and a
combination of the previous three. In designing this in SOLR, I found that
using a combination of explicit field definitions and dynamic fields (
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields) gave me the best
possible solution for the problem.

In essence, I created explicit fields that would be shared among all
document types: a unique id, a document type, an indexed date, a modified
date, and maybe a couple of other fields that share traits with all document
types (i.e., name, a market specific to our business, etc). The unique id
was built as a string, and was prefixed with the document type, and it ended
with the unique id from the database.

The dynamic fields can be configured to be as flexible as you need, and in
my experience I would strongly recommend documenting each type of dynamic
field for each of your document types as a reference for your developers
(and yourself). :)

This allows us to build queries that can be focused on specific document
types, or combining all of the types into a super search. For example, you
could something to the effect of: (docType: people) AND (df_firstName:John
AND df_lastName:Hancock), (docType:companies) AND
(df_BusinessName:Acme+Inc), or even ((df_firstName:John AND
df_lastName:Hancock) OR (df_BusinessName:Acme+Inc)).

I hope this helps!

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Tue, Oct 19, 2010 at 4:57 PM, Olson, Ron rol...@lbpc.com wrote:

 Hi all-

 I have a newbie design question about documents, especially with SQL
 databases. I am trying to set up Solr to go against a database that, for
 example, has items and people. The way I see it, and I don't know if
 this is right or not (thus the question), is that I see both as separate
 documents as an item may contain a list of parts, which the user may want to
 search, and, as part of the item, view the list of people who have ordered
 the item.

 Then there's the actual people, who the user might want to search to find
 a name and, consequently, what items they ordered. To me they are both top
 level things, with some overlap of fields. If I'm searching for people,
 I'm likely not going to be interested in the parts of the item, while if I'm
 searching for items the likelihood is that I may want to search for
 42532 which is, in this instance, a SKU, and not get hits on the zip code
 section of the people.

 Does it make sense, then, to separate these two out as separate documents?
 I believe so because the documentation I've read suggests that a document
 should be analogous to a row in a table (in this case, very de-normalized).
 What is tripping me up is, as far as I can tell, you can have only one
 document type per index, and thus one document per core. So in this example,
 I have two cores, items and people. Is this correct? Should I embrace
 the idea of having many cores or am I supposed to have a single, unified
 index with all documents (which doesn't seem like Solr supports).

 The ultimate question comes down to the search interface. I don't
 necessarily want to have the user explicitly state which document they want
 to search; I'd like them to simply type 42532 and get documents from both
 cores, and then possibly allow for filtering results after the fact, not
 before. As I've only used the admin site so far (which is core-specific),
 does the client API allow for unified searching across all cores? Assuming
 it does, I'd think my idea of multiple-documents is okay, but I'd love to
 hear from people who actually know what they're doing. :)

 Thanks,

 Ron

 BTW: Sorry about the problem with the previous message; I didn't know about
 thread hijacking.

 DISCLAIMER: This electronic message, including any attachments, files or
 documents, is intended only for the addressee and may contain CONFIDENTIAL,
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended
 recipient, you are hereby notified that any use, disclosure, copying or
 distribution of this message or any of the information included in or with
 it is  unauthorized and strictly prohibited.  If you have received this
 message in error, please notify the sender immediately by reply e-mail and
 permanently delete and destroy this message and its attachments, along with
 any copies thereof. This message does not create any contractual obligation
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.



Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
Just following up to see if anybody might have some words of wisdom on the
issue?

Thank you,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley doh...@gmail.com wrote:

 Hello all,

 I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow
 the advice from
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmlabout 
 converting date fields to SortableLong fields for better memory
 efficiency. However, whenever I try to do this using the DateFormater, I get
 exceptions when indexing for every row that tries to create my sortable
 fields.

 In my schema.xml, I have the following definitions for the fieldType and
 dynamicField:

 fieldType name=sdate class=solr.SortableLongField indexed=true
 stored=false sortMissingLast=true omitNorms=true /
 dynamicField name=sort_date_* type=sdate stored=false indexed=true
 /

 In my dih.xml, I have the following definitions:

 dataConfig
 dataSource type=FileDataSource encoding=UTF-8 /
 entity
 name=xml_stories
 rootEntity=false
 dataSource=null
 processor=FileListEntityProcessor
 fileName=legacy_stories.*\.xml$
 recursive=false
 baseDir=/usr/local/extracts
 newerThan=${dataimporter.xml_stories.last_index_time}
 
 entity
 name=stories
 pk=id
 dataSource=xml_stories
 processor=XPathEntityProcessor
 url=${xml_stories.fileAbsolutePath}
 forEach=/RECORDS/RECORD
 stream=true

 transformer=DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer
 onError=continue
 
 field column=_modified_date
 xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL /
 field column=modified_date
 sourceColName=_modified_date dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /

 field column=_df_date_published
 xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL /
 field column=df_date_published
 sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z'
 /

 field column=sort_date_modified
 sourceColName=modified_date dateTimeFormat=MMddhhmmss /
 field column=sort_date_published
 sourceColName=df_date_published dateTimeFormat=MMddhhmmss /
 /entity
 /entity
 /document
 /dataConfig

 The fields in question are in the formats:

 RECORDS
 RECORD
 PROP NAME=R_StoryDate
 PVAL2001-12-04T00:00:00Z/PVAL
 /PROP
 PROP NAME=R_ModifiedTime
 PVAL2001-12-04T19:38:01Z/PVAL
 /PROP
 /RECORD
 /RECORDS

 The exception that I am receiving is:

 Oct 15, 2010 6:23:24 PM
 org.apache.solr.handler.dataimport.DateFormatTransformer transformRow
 WARNING: Could not parse a Date field
 java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007
 at java.text.DateFormat.parse(DateFormat.java:337)
 at
 org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89)
 at
 org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)

 I know that it has to be the SortableLong fields, because if I remove just
 those two lines from my dih.xml, everything imports as I expect it to. Am I
 doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is
 this not supported in my version of SOLR? I'm not very experienced with
 Java, so digging into the code would be a lost cause for me right now. I was
 hoping that somebody here might be able to help point me in the
 right/correct direction.

 It should be noted that the modified_date and df_date_published fields
 index just fine (so long as I do it as I've defined above).

 Thank you,

 - Ken

 It looked like something resembling white marble, which was
 probably what

Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
On Mon, Oct 18, 2010 at 7:52 AM, Michael Sokolov soko...@ifactory.comwrote:

 I think if you look closely you'll find the date quoted in the Exception
 report doesn't match any of the declared formats in the schema.  I would
 suggest, as a first step, hunting through your data to see where that date
 is coming from.

 -Mike


[Note: RE-sending this because apparently in my sleepy-stupor, I clicked to
wrong Reply button and never sent this to the list (It's a Monday) :)]

I've noticed that date anomaly as well, and I've discovered that is one of
the gotchas of DIH: it seems to modify my date to that format. All of the
dates in the data are in the correct -MM-dd'T'hh:mm:ss'Z' format. Once
it is run through dateTImeFormat, I assume it is converted into a date
object; trying to use that date object in any other form (i.e., using
template, or even another dateTimeFormat) results in the exception I've
described (displaying the date in the incorrect format).

Thanks,

Ken Stanley


Re: problem on running fullimport

2010-10-15 Thread Ken Stanley
On Fri, Oct 15, 2010 at 7:42 AM, swapnil dubey swapnil.du...@gmail.comwrote:

 Hi,

 I am using the full import option with the data-config file as mentioned
 below

 dataConfig
   dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql:///xxx user=xxx password=xx  /
document 
entity name=yyy query=select studentName from test1
field column=studentName name=studentName /
/entity
/document
 /dataConfig


 on running the full-import option I am getting the error mentioned below.I
 had already included the dataimport.properties file in my conf file.help me
 to get the issue resolved

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime334/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 str name=commandfull-import/str
 str name=modedebug/str
 null name=documents/
 -
 lst name=verbose-output
 -
 lst name=entity:test1
 -
 lst name=document#1
 str name=queryselect studentName from test1/str
 -
 str name=EXCEPTION
 org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
 execute query: select studentName from test1 Processing Document # 1
 ...

 --
 Regards
 Swapnil Dubey


Swapnil,

Everything looks fine, except that in your entity definition you forgot to
define which datasource you wish to use. So if you add the
'dataSource=JdbcDataSource' that should get rid of your exception. As a
reminder, the DataImportHandler wiki (
http://wiki.apache.org/solr/DataImportHandler) on Apache's website is very
helpful with learning how to use the DIH properly. It has helped me with
having a printed copy beside me for easy and quick reference.

- Ken


SOLR DateTime and SortableLongField field type problems

2010-10-15 Thread Ken Stanley
Hello all,

I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow
the advice from
http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.html about
converting date fields to SortableLong fields for better memory efficiency.
However, whenever I try to do this using the DateFormater, I get exceptions
when indexing for every row that tries to create my sortable fields.

In my schema.xml, I have the following definitions for the fieldType and
dynamicField:

fieldType name=sdate class=solr.SortableLongField indexed=true
stored=false sortMissingLast=true omitNorms=true /
dynamicField name=sort_date_* type=sdate stored=false indexed=true
/

In my dih.xml, I have the following definitions:

dataConfig
dataSource type=FileDataSource encoding=UTF-8 /
entity
name=xml_stories
rootEntity=false
dataSource=null
processor=FileListEntityProcessor
fileName=legacy_stories.*\.xml$
recursive=false
baseDir=/usr/local/extracts
newerThan=${dataimporter.xml_stories.last_index_time}

entity
name=stories
pk=id
dataSource=xml_stories
processor=XPathEntityProcessor
url=${xml_stories.fileAbsolutePath}
forEach=/RECORDS/RECORD
stream=true

transformer=DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer
onError=continue

field column=_modified_date
xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL /
field column=modified_date sourceColName=_modified_date
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /

field column=_df_date_published
xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL /
field column=df_date_published
sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z'
/

field column=sort_date_modified
sourceColName=modified_date dateTimeFormat=MMddhhmmss /
field column=sort_date_published
sourceColName=df_date_published dateTimeFormat=MMddhhmmss /
/entity
/entity
/document
/dataConfig

The fields in question are in the formats:

RECORDS
RECORD
PROP NAME=R_StoryDate
PVAL2001-12-04T00:00:00Z/PVAL
/PROP
PROP NAME=R_ModifiedTime
PVAL2001-12-04T19:38:01Z/PVAL
/PROP
/RECORD
/RECORDS

The exception that I am receiving is:

Oct 15, 2010 6:23:24 PM
org.apache.solr.handler.dataimport.DateFormatTransformer transformRow
WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007
at java.text.DateFormat.parse(DateFormat.java:337)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)

I know that it has to be the SortableLong fields, because if I remove just
those two lines from my dih.xml, everything imports as I expect it to. Am I
doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is
this not supported in my version of SOLR? I'm not very experienced with
Java, so digging into the code would be a lost cause for me right now. I was
hoping that somebody here might be able to help point me in the
right/correct direction.

It should be noted that the modified_date and df_date_published fields index
just fine (so long as I do it as I've defined above).

Thank you,

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


Re: Searching Across Multiple Cores

2010-10-14 Thread Ken Stanley
Steve,

Using shards is actually quite simple; it's just a matter of setting up your
shards (via multiple cores, or multiple instances of SOLR) and then passing
the shards parameter in the query string. The shards parameter is a
comma-separated list of the servers/cores you wish to use together.

So, let's try this using a fictitious example. You have two cores, one
called main for your main data set of metadata and favorites for your user
favorites meta data. You set up each schema accordingly, and you've indexed
your data. When you want to do a query on both sets of data you would build
your query appropriately, and then use the following URL (the host is
assumed to be localhost for simplicity):

http://localhost/solr/main/select?q=id:[*+TO+*]shards=localhost/solr/main,localhost/solr/favoritesrows=100start=0

I am personally investigating using this technique to tie together two cores
that utilize different schemas; one schema will contain news articles,
blogs, and similar types of data, while another schema will contain
company-specific information, such as addresses, etc. If you're still having
trouble after trying this, let me know and I'd be more than happy to share
any findings that I come across.

I hope that this helps to clear things up for you. :)

- Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, The Hitchhikers Guide to the Galaxy


On Thu, Oct 14, 2010 at 4:25 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 Ken,

 I have been through that page many times. I could use Distributed search
 for what? The first scenario or the second?

 The question is: can I merge a set of results from the two cores/shards and
 only return results that exist in both (determined by the resourceId, which
 exists on both)?

 Cheers,
 Steve

 -Original Message-
 From: Ken Stanley [mailto:doh...@gmail.com]
 Sent: 13 October 2010 20:08
 To: solr-user@lucene.apache.org
 Subject: Re: Searching Across Multiple Cores

 On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
 steven.lohr...@hmhpub.comwrote:

  Hi,
 
  I am trying to figure out if how I can accomplish the following:
 
  I have a fairly static and large set of resources I need to have indexed
  and searchable. Solr seems to be a perfect fit for that. In addition I
 need
  to have the ability for my users to add resources from the main data set
 to
  a 'Favourites' folder (which can include a few more tags added by them).
 The
  Favourites needs to be searchable in the same manner as the main data
 set,
  across all the same fields.
 
  My first thought was to have two separate schemas
  - the first  for the main data set and its metadata
  - the second for the Favourites folder with all of the metadata from the
  main set copied over and then adding the additional fields.
 
  Then I thought that would probably waste quite a bit of space (the number
  of users is much larger than the number of main resources).
 
  So then I thought I could have the main data set with its metadata. Then
  there would be second one for the Favourites folder with the unique id
 from
  the first and the additional fields it needs (userId, grade, folder,
 tag).
  In addition, I would create another schema/core with all the fields from
 the
  other two and have a request handler defined on it that searches across
 the
  other 2 cores and returns the results through this core.
 
  This third core would have searches run against it where the results
 would
  expect to only be returned for a single user. For example, a user
 searches
  their Favourites folder for all the items with Foo. The result is only
 those
  items the user has added to their Favourites with Foo somewhere in their
  main data set metadata.
 
  Could this be made to work? What would the consequences be? Any
 alternative
  suggestions?
 
  Thanks,
  Steve
 
 
 Steve,

 From your description, it really sounds like you could reap the benefits of
 using Distributed Search in SOLR:

 http://wiki.apache.org/solr/DistributedSearch

 I hope that this helps.

 - Ken



Re: Searching Across Multiple Cores

2010-10-13 Thread Ken Stanley
On Wed, Oct 13, 2010 at 2:11 PM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 Hi,

 I am trying to figure out if how I can accomplish the following:

 I have a fairly static and large set of resources I need to have indexed
 and searchable. Solr seems to be a perfect fit for that. In addition I need
 to have the ability for my users to add resources from the main data set to
 a 'Favourites' folder (which can include a few more tags added by them). The
 Favourites needs to be searchable in the same manner as the main data set,
 across all the same fields.

 My first thought was to have two separate schemas
 - the first  for the main data set and its metadata
 - the second for the Favourites folder with all of the metadata from the
 main set copied over and then adding the additional fields.

 Then I thought that would probably waste quite a bit of space (the number
 of users is much larger than the number of main resources).

 So then I thought I could have the main data set with its metadata. Then
 there would be second one for the Favourites folder with the unique id from
 the first and the additional fields it needs (userId, grade, folder, tag).
 In addition, I would create another schema/core with all the fields from the
 other two and have a request handler defined on it that searches across the
 other 2 cores and returns the results through this core.

 This third core would have searches run against it where the results would
 expect to only be returned for a single user. For example, a user searches
 their Favourites folder for all the items with Foo. The result is only those
 items the user has added to their Favourites with Foo somewhere in their
 main data set metadata.

 Could this be made to work? What would the consequences be? Any alternative
 suggestions?

 Thanks,
 Steve


Steve,

From your description, it really sounds like you could reap the benefits of
using Distributed Search in SOLR:

http://wiki.apache.org/solr/DistributedSearch

I hope that this helps.

- Ken


Re: searching while importing

2010-10-13 Thread Ken Stanley
On Wed, Oct 13, 2010 at 6:38 PM, Shawn Heisey s...@elyograg.org wrote:

  If you are using the DataImportHandler, you will not be able to search new
 data until the full-import or delta-import is complete and the update is
 committed.  When I do a full reindex, it takes about 5 hours, and until it
 is finished, I cannot search it.

 This is not true; when I use the DIH to do a full-import, I (and my team)
are still able to search on the already-indexed data that exists.


 I have not tried to issue a manual commit in the middle of an import to see
 whether that makes data inserted up to that point searchable, but I would
 not expect that to work.

 If you set the autoCommit properties maxDocs and maxTime to reasonable
values, then once those limits are reached, I suspect that SOLR would commit
and continue indexing; however, I have not had the chance to use those
features in solrconfig.xml.


 If you need this kind of functionality, you may need to change your build
 system so that a full import clears the index manually and then does a
 series of delta-import batches.

 The only time I've had an issue with being able to search while indexing is
when my DIH had mis-configuration that caused the import to finish without
indexing anything, thus wiping out my data. Aside of that, I continually
index and search at the same time almost every day (using 1.4.1).




 On 10/13/2010 3:51 PM, Tri Nguyen wrote:

 Hi,
  Can I perform searches against the index while it is being imported?
  Does importing add 1 document at a time or will solr make a temporary
 index and
 switch to that index when indexing is done?
  Thanks,
  Tri





Re: Solr PHP PECL Extension going to Stable Release - Wishing for Any New Features?

2010-10-12 Thread Ken Stanley

   If you are using Solr via PHP and would like to see any new features in
  the
   extension please feel free to send me a note.


I'm new to this list, but in seeing this thread - and using PHP SOLR - I
wanted to make a suggestion that - while minor - I think would greatly
improve the quality of the extension.

(I'm basing this mostly off of SolrQuery since that's where I've encountered
the issue, but this might be true elsewhere)

Whenever a method is supposed to return an array (i.e.,
SolrQuery::getFields(), SolrQuery::getFacets(), etc), if there is no data to
return, a null is returned. I think that this should be normalized across
the board to return an empty array. First, the documentation is
contradictory (http://us.php.net/manual/en/solrquery.getfields.php) in that
the method signature says that it returns an array (not mixed), while the
Return Values section says that it returns either an array or null.
Secondly, returning an array under any circumstance provides more
consistency and less logic; for example, let's say that I am looking for the
fields (as-is in its current state):

?php
// .. assume a proper set up
if ($solrquery-getFields() !== null) {
foreach ($solrquery-getFields() as $field) {
// Do something
}
}
?

This is a minor request, I know. But, I feel that it would go a long way
toward polishing the extension up for general consumption.

Thank you,

Ken Stanley

PS. I apologize if this request has come through the pipes already; as I've
stated, I am new to this list; I have yet to find any reference to my
request. :)