Specifying multiple documents in DataImportHandler dataConfig

2009-09-08 Thread Rupert Fiasco
I am using the DataImportHandler with a JDBC datasource. From my
understanding of DIH, for each of my content types e.g. Blog posts,
Mesh Categories, etc I would construct a series of document/entity
sets, like

dataConfig
dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql:// /

!-- BLOG ENTRIES --
document name=blog_entries
  entity name=blog_entries query=select
id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
from blog_entries
field column=id name=pk_i /
field column=id name=id /
field column=title name=text_t /
field column=data name=text_t /
  /entity
/document

!-- MESH CATEGORIES --
document name=mesh_category
  entity name=mesh_categories query=select
id,name,node_key,name as name_fc,'MeshCategory' as type from
mesh_categories
field column=id name=pk_i /
field column=id name=id /
field column=name name=text_t /
field column=node_key name=string /
field column=name_fc name=facet_value /
field column=type name=type_t /
  /entity
/document
/datasource
/dataConfig


Solr parses this just fine and allows me to issue a
/dataimport?command=full-import and it runs, but it only runs against
the first document (blog_entries). It doesnt run against the 2nd
document (mesh_categories).

If I remove the 2 document elements and wrap both entity sets in just
one document tag, then both sets get indexed, which seemingly achieves
my goal. This just doesnt make sense from my understanding of how DIH
works. My 2 content types are indeed separate so they logically
represent two document types, not one.

Is this correct? What am I missing here?

Thanks
-Rupert


Re: Specifying multiple documents in DataImportHandler dataConfig

2009-09-08 Thread Rupert Fiasco
Maybe I should be more clear: I have multiple tables in my DB that I
need to save to my Solr index. In my app code I have logic to persist
each table, which maps to an application model to Solr. This is fine.
I am just trying to speed up indexing time by using DIH instead of
going through my application. From what I understand of DIH I can
specify one dataSource element and then a series of document/entity
sets, for each of my models. But like I said before, DIH only appears
to want to index the first document declared under the dataSource tag.

-Rupert

On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiascorufia...@gmail.com wrote:
 I am using the DataImportHandler with a JDBC datasource. From my
 understanding of DIH, for each of my content types e.g. Blog posts,
 Mesh Categories, etc I would construct a series of document/entity
 sets, like

 dataConfig
 dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql:// /

    !-- BLOG ENTRIES --
    document name=blog_entries
      entity name=blog_entries query=select
 id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
 from blog_entries
        field column=id name=pk_i /
        field column=id name=id /
        field column=title name=text_t /
        field column=data name=text_t /
      /entity
    /document

    !-- MESH CATEGORIES --
    document name=mesh_category
      entity name=mesh_categories query=select
 id,name,node_key,name as name_fc,'MeshCategory' as type from
 mesh_categories
        field column=id name=pk_i /
        field column=id name=id /
        field column=name name=text_t /
        field column=node_key name=string /
        field column=name_fc name=facet_value /
        field column=type name=type_t /
      /entity
    /document
 /datasource
 /dataConfig


 Solr parses this just fine and allows me to issue a
 /dataimport?command=full-import and it runs, but it only runs against
 the first document (blog_entries). It doesnt run against the 2nd
 document (mesh_categories).

 If I remove the 2 document elements and wrap both entity sets in just
 one document tag, then both sets get indexed, which seemingly achieves
 my goal. This just doesnt make sense from my understanding of how DIH
 works. My 2 content types are indeed separate so they logically
 represent two document types, not one.

 Is this correct? What am I missing here?

 Thanks
 -Rupert



Re: Responses getting truncated

2009-09-03 Thread Rupert Fiasco
So we have been running LucidWorks for Solr for about a week now and
have seen no problems - so I believe it was due to that buffering
issue in Jetty 6.1.3, estimated here:

 It really looks like you're hitting a lower-level IO buffering bug
 (esp when you see a response starting off with the tail of another
 response).  That doesn't look like it could be a Solr bug... but
 rather smells like a thread safety bug in the servlet container.

Thanks for everyones help and input. LucidWorks For The Win.

-Rupert

On Fri, Aug 28, 2009 at 4:07 PM, Rupert Fiascorufia...@gmail.com wrote:
 I deployed LucidWorks with my existing solrconfig / schema and
 re-indexed my data into it and pushed it out to production, we'll see
 how it stacks up over the weekend. Already queries that were breaking
 on the prior Jetty/stock Solr setup are now working - but I have seen
 it before where upon an initial re-index things work OK then a couple
 of days later they break.

 Keep y'all posted.

 Thanks
 -Rupert

 On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiascorufia...@gmail.com wrote:
 Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

 Versions / architectures:

 Jetty(6.1.3)

 o...@medsolr1 ~ $ uname -a
 Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
 x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

 o...@medsolr1 ~ $ java -version
 java version 1.6.0_11
 Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


 I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

 -Rupert

 On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeleyysee...@gmail.com wrote:
 On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiascorufia...@gmail.com wrote:
 If I run these through curl on the command its
 truncated and if I run the search through the web-based admin panel
 then I get an XML parse error.

 Are you running curl directly against the solr server, or going
 through a load balancer?  Cutting out the middle-men using curl was a
 great idea - just make sure to go all the way.

 At first I thought it could possibly be a FastWriter bug (internal
 Solr class), but that's only used on the TextWriter (JSON, Python,
 Ruby) based formats, not on the original XML format.

 It really looks like you're hitting a lower-level IO buffering bug
 (esp when you see a response starting off with the tail of another
 response).  That doesn't look like it could be a Solr bug... but
 rather smells like a thread safety bug in the servlet container.

 What type of machine are you running on?  What JVM?
 You could try upgrading your version of Jetty, the JVM, or try
 switching to Tomcat.

 -Yonik
 http://www.lucidimagination.com


 This appears to have just started recently and the only thing we have
 done is change our indexer from a PHP one to a Java one, but
 functionally they are identical.

 Any thoughts? Thanks in advance.

 - Rupert






Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco
Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the tail of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

 i assume you are correct, but you listed several steps of transformation
 above, are you certian they all work correctly and produce valid UTF-8?

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetterhossman_luc...@fucit.org wrote:

 : We are running an instance of MediaWiki so the text goes through a
 : couple of transformations: wiki markup - html - plain text.
 : Its at this last step that I take a snippet and insert that into Solr.
        ...
 : doc.addField(text_snippet_t, article.getSnippet(1000));

 ok, well first off: that's the not the field we're you are having problems
 is it?  if i remember correctly from your previous posts, wasn't the
 response getting aborted in the middle of the Contents field?

 : and a maximum of 1K chars if its bigger. I initialized this String
 : from the DB by using the String constructor where I pass in the
 : charset/collation
 :
 : text = new String(textFromDB, UTF-8);
 :
 : So to the best of my knowledge, accessing a substring of a UTF-8
 : encoded string should not break up the UTF-8 code point. Is that an

 i assume you are correct, but you listed several steps of transformation
 above, are you certian they all work correctly and produce valid UTF-8?

 this leads back to my suggestion before

 :  Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
 :  file that this solr doc came from online somewhere?
 : 
 :  What does your *indexing* code look like? ... Can you add some debuging to
 :  the SolrJ client when you *add* this doc to print out exactly what those
 :  1000 characters are?


 -Hoss



Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco
I know in my last message I said I was having issues with extra
content at the start of a response, resulting in an invalid document.
I still am having issues with documents getting truncated (yes, I have
problems galore).

I will elaborate on why its so difficult to track down an actual
document which is causing the failure (if I could find the document I
could post it to the group) - causing an invalid / truncated document.

I will just document the steps:

1) I have a query which results in a bogus truncated document. This
query pulls back all fields. If I take that same query and remove the
text_t field from the returned field list, then all is well. This
indicates to me that its a problem with the text_t field. This query
uses the default returned rows of 10.

2) So far so good. Then my next step is to find the document. So I
take my original query, remove the text_t from the field list to get
my result set.

3) I run a new query that JUST selects that document based on its Doc
ID (which I have from the first query). My thinking is that my
broken document HAS to be in that set, so I can just select it by ID
and then validate the response.

This is where it breaks down: I know one or more broken documents is
in my set, but if I iterate over each doc id and pull it out
individually, its response is valid. Its only broken when I pull it
out in the first query. Its NOT broken when I pull it out by ID, even
though I am also pulling out the same broken field.

If you can read Ruby my script is here:

http://brockwine.com/solr_fetch.txt

In the first net/http call, if I include the text_t field in the
fl list then it breaks. If I remove it, get the doc ids and then
iterate over each one and get it back from Solr (including the
supposed broken field text_t) then it works just fine - the
exception is never raised. But it is raised in the first call if I
include it.

To me this makes absolutely no sense.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderoncalderon@gmail.com wrote:
 i had a similar issue with text from past requests showing up, this was on
 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went
 away, im using a nightly of 1.4 right now also without probs, then again
 your mileage may vary as i also made a bunch of schema changes that might
 have had some effect, it wouldnt hurt to try though


 On 08/28/2009 02:04 PM, Rupert Fiasco wrote:

 Firstly, to everyone who has been helping me, thank you very much. All
 this feedback is helping me narrow down these issues.

 I deleted the index and re-indexed all the data from scratch and for a
 couple of days we were OK, but now it seems to be erring again.

 It happens on different input documents so what was broken before now
 works (documents that were having issues before are OK now, after a
 fresh re-index).

 An issue we are seeing now is that an XML response from Solr will
 contain the tail of an earlier response, for an example:

 http://brockwine.com/solr2.txt

 That is a response we are getting from Solr - using the web interface
 for Solr in Firefox, Firefox freaks out because it tries to parse
 that, and of course, its invalid XML, but I can retrieve that via
 curl.

 Anyone seeing this before?

 In regards to earlier questions:



 i assume you are correct, but you listed several steps of transformation
 above, are you certian they all work correctly and produce valid UTF-8?


 Yes, I have looked at the source and contacted the author of the
 conversion library we are using and have verified that if UTF8 goes in
 then UTF8 will come out and UTF8 is definitely going in.

 I dont think sending over an actual input document would help because
 it seems to change. Plus, this latest issue appears to be more an
 issue of the last response buffer not clearing or something.

 Whats strange is that if I wait a few minutes and reload, then the
 buffer is cleared and I get back a valid response, its intermittent,
 but appears to be happening frequently.

 If it matters, we started using LucidGaze for Solr about 10 days ago,
 approximately when these issues started happening (but its hard to say
 if thats an issue because at this same time we switched from a PHP to
 Java indexing client).

 Thanks for your patience

 -Rupert

 On Tue, Aug 25, 2009 at 8:33 PM, Chris
 Hostetterhossman_luc...@fucit.org  wrote:


 : We are running an instance of MediaWiki so the text goes through a
 : couple of transformations: wiki markup -  html -  plain text.
 : Its at this last step that I take a snippet and insert that into
 Solr.
        ...
 : doc.addField(text_snippet_t, article.getSnippet(1000));

 ok, well first off: that's the not the field we're you are having
 problems
 is it?  if i remember correctly from your previous posts, wasn't the
 response getting aborted in the middle of the Contents field?

 : and a maximum of 1K chars if its bigger. I initialized this String
 : from the DB by using the String constructor where I

Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco
Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

Versions / architectures:

Jetty(6.1.3)

o...@medsolr1 ~ $ uname -a
Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

o...@medsolr1 ~ $ java -version
java version 1.6.0_11
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

-Rupert

On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeleyysee...@gmail.com wrote:
 On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiascorufia...@gmail.com wrote:
 If I run these through curl on the command its
 truncated and if I run the search through the web-based admin panel
 then I get an XML parse error.

 Are you running curl directly against the solr server, or going
 through a load balancer?  Cutting out the middle-men using curl was a
 great idea - just make sure to go all the way.

 At first I thought it could possibly be a FastWriter bug (internal
 Solr class), but that's only used on the TextWriter (JSON, Python,
 Ruby) based formats, not on the original XML format.

 It really looks like you're hitting a lower-level IO buffering bug
 (esp when you see a response starting off with the tail of another
 response).  That doesn't look like it could be a Solr bug... but
 rather smells like a thread safety bug in the servlet container.

 What type of machine are you running on?  What JVM?
 You could try upgrading your version of Jetty, the JVM, or try
 switching to Tomcat.

 -Yonik
 http://www.lucidimagination.com


 This appears to have just started recently and the only thing we have
 done is change our indexer from a PHP one to a Java one, but
 functionally they are identical.

 Any thoughts? Thanks in advance.

 - Rupert




Re: Responses getting truncated

2009-08-28 Thread Rupert Fiasco
I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stacks up over the weekend. Already queries that were breaking
on the prior Jetty/stock Solr setup are now working - but I have seen
it before where upon an initial re-index things work OK then a couple
of days later they break.

Keep y'all posted.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiascorufia...@gmail.com wrote:
 Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

 Versions / architectures:

 Jetty(6.1.3)

 o...@medsolr1 ~ $ uname -a
 Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
 x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

 o...@medsolr1 ~ $ java -version
 java version 1.6.0_11
 Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


 I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

 -Rupert

 On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeleyysee...@gmail.com wrote:
 On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiascorufia...@gmail.com wrote:
 If I run these through curl on the command its
 truncated and if I run the search through the web-based admin panel
 then I get an XML parse error.

 Are you running curl directly against the solr server, or going
 through a load balancer?  Cutting out the middle-men using curl was a
 great idea - just make sure to go all the way.

 At first I thought it could possibly be a FastWriter bug (internal
 Solr class), but that's only used on the TextWriter (JSON, Python,
 Ruby) based formats, not on the original XML format.

 It really looks like you're hitting a lower-level IO buffering bug
 (esp when you see a response starting off with the tail of another
 response).  That doesn't look like it could be a Solr bug... but
 rather smells like a thread safety bug in the servlet container.

 What type of machine are you running on?  What JVM?
 You could try upgrading your version of Jetty, the JVM, or try
 switching to Tomcat.

 -Yonik
 http://www.lucidimagination.com


 This appears to have just started recently and the only thing we have
 done is change our indexer from a PHP one to a Java one, but
 functionally they are identical.

 Any thoughts? Thanks in advance.

 - Rupert





Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
Using wt=json also yields an invalid document. So after more
investigation it appears that I can always break the response by
pulling back a specific field via the fl parameter. If I leave off a
field then the response is valid, if I include it then Solr yields an
invalid document - a truncated document. This happens in any response
format (xml, json, ruby).

I am using the SolrJ client to add documents to in my index. My field
is a normal text field type and the text itself is the first 1000
characters of an article.

 It can very well be an issue with the data itself. For example, if the data
 contains un-escaped characters which invalidates the response

When I look at the document in using wt=xml then all XML entities are
escaped. When I look at it under wt=ruby then all single quotes are
escaped, same for json, so it appears that all escaping it taking
place. The core problem seems to be that the document is just
truncated - it just plain end of files. Jetty's log says its sending
back an HTTP 200 so all is well.

Any ideas on how I can dig deeper?

Thanks
-Rupert


On Mon, Aug 24, 2009 at 4:31 PM, Uri Bonessubon...@gmail.com wrote:
 It can very well be an issue with the data itself. For example, if the data
 contains un-escaped characters which invalidates the response. I don't know
 much about ruby, but what do you get with wt=json?

 Rupert Fiasco wrote:

 I am seeing our responses getting truncated if and only if I search on
 our main text field.

 E.g. I just do some basic like

 title_t:arthritis

 Then I get a valid document back. But if I add in our larger text field:

 title_t:arthritis OR text_t:arthritis

 then the resultant document is NOT valid XML (if using wt=xml) or Ruby
 (using wt=ruby). If I run these through curl on the command its
 truncated and if I run the search through the web-based admin panel
 then I get an XML parse error.

 This appears to have just started recently and the only thing we have
 done is change our indexer from a PHP one to a Java one, but
 functionally they are identical.

 Any thoughts? Thanks in advance.

 - Rupert





Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
The text file at:

http://brockwine.com/solr.txt

Represents one of these truncated responses (this one in XML). It
starts out great, then look at the bottom, boom, game over. :)

I found this document by first running our bigger search which breaks
and then zeroing in a specific broken document by using the rows/start
parameters. But there are any unknown number of these broken
documents - a lot I presume.

-Rupert

On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singhavl...@gmail.com wrote:
 Can you copy-paste the source data indexed in this field which causes the
 error?

 Cheers
 Avlesh

 On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco rufia...@gmail.com wrote:

 Using wt=json also yields an invalid document. So after more
 investigation it appears that I can always break the response by
 pulling back a specific field via the fl parameter. If I leave off a
 field then the response is valid, if I include it then Solr yields an
 invalid document - a truncated document. This happens in any response
 format (xml, json, ruby).

 I am using the SolrJ client to add documents to in my index. My field
 is a normal text field type and the text itself is the first 1000
 characters of an article.

  It can very well be an issue with the data itself. For example, if the
 data
  contains un-escaped characters which invalidates the response

 When I look at the document in using wt=xml then all XML entities are
 escaped. When I look at it under wt=ruby then all single quotes are
 escaped, same for json, so it appears that all escaping it taking
 place. The core problem seems to be that the document is just
 truncated - it just plain end of files. Jetty's log says its sending
 back an HTTP 200 so all is well.

 Any ideas on how I can dig deeper?

 Thanks
 -Rupert


 On Mon, Aug 24, 2009 at 4:31 PM, Uri Bonessubon...@gmail.com wrote:
  It can very well be an issue with the data itself. For example, if the
 data
  contains un-escaped characters which invalidates the response. I don't
 know
  much about ruby, but what do you get with wt=json?
 
  Rupert Fiasco wrote:
 
  I am seeing our responses getting truncated if and only if I search on
  our main text field.
 
  E.g. I just do some basic like
 
  title_t:arthritis
 
  Then I get a valid document back. But if I add in our larger text field:
 
  title_t:arthritis OR text_t:arthritis
 
  then the resultant document is NOT valid XML (if using wt=xml) or Ruby
  (using wt=ruby). If I run these through curl on the command its
  truncated and if I run the search through the web-based admin panel
  then I get an XML parse error.
 
  This appears to have just started recently and the only thing we have
  done is change our indexer from a PHP one to a Java one, but
  functionally they are identical.
 
  Any thoughts? Thanks in advance.
 
  - Rupert
 
 
 




Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
So I whipped up a quick SolrJ client and ran it against the document
that I referenced earlier. When I retrieve the doc and just print its
field/value pairs to stdout it ends like this:

http://brockwine.com/images/output1.png

It appears to be some kind of garbage characters.

-Rupert

On Tue, Aug 25, 2009 at 12:19 PM, Uri Bonessubon...@gmail.com wrote:
 Hi,

 This is a very strange behavior and the fact that it is cause by one
 specific field, again, leads me to believe it's still a data issue. Did you
 try using SolrJ to query the data as well? If the same thing happens when
 using the binary protocol, then it's probably not a data issue. On the other
 hand, if it works fine, then at least you can inspect the data to see where
 things go wrong. Sorry for insisting on that, but I cannot think of anything
 else that can cause this problem.

 If anyone else have a better idea, I'm actually very curious to hear about
 it.

 Uri

 Rupert Fiasco wrote:

 The text file at:

 http://brockwine.com/solr.txt

 Represents one of these truncated responses (this one in XML). It
 starts out great, then look at the bottom, boom, game over. :)

 I found this document by first running our bigger search which breaks
 and then zeroing in a specific broken document by using the rows/start
 parameters. But there are any unknown number of these broken
 documents - a lot I presume.

 -Rupert

 On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singhavl...@gmail.com wrote:


 Can you copy-paste the source data indexed in this field which causes the
 error?

 Cheers
 Avlesh

 On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco rufia...@gmail.com
 wrote:



 Using wt=json also yields an invalid document. So after more
 investigation it appears that I can always break the response by
 pulling back a specific field via the fl parameter. If I leave off a
 field then the response is valid, if I include it then Solr yields an
 invalid document - a truncated document. This happens in any response
 format (xml, json, ruby).

 I am using the SolrJ client to add documents to in my index. My field
 is a normal text field type and the text itself is the first 1000
 characters of an article.



 It can very well be an issue with the data itself. For example, if the


 data


 contains un-escaped characters which invalidates the response


 When I look at the document in using wt=xml then all XML entities are
 escaped. When I look at it under wt=ruby then all single quotes are
 escaped, same for json, so it appears that all escaping it taking
 place. The core problem seems to be that the document is just
 truncated - it just plain end of files. Jetty's log says its sending
 back an HTTP 200 so all is well.

 Any ideas on how I can dig deeper?

 Thanks
 -Rupert


 On Mon, Aug 24, 2009 at 4:31 PM, Uri Bonessubon...@gmail.com wrote:


 It can very well be an issue with the data itself. For example, if the


 data


 contains un-escaped characters which invalidates the response. I don't


 know


 much about ruby, but what do you get with wt=json?

 Rupert Fiasco wrote:


 I am seeing our responses getting truncated if and only if I search on
 our main text field.

 E.g. I just do some basic like

 title_t:arthritis

 Then I get a valid document back. But if I add in our larger text
 field:

 title_t:arthritis OR text_t:arthritis

 then the resultant document is NOT valid XML (if using wt=xml) or Ruby
 (using wt=ruby). If I run these through curl on the command its
 truncated and if I run the search through the web-based admin panel
 then I get an XML parse error.

 This appears to have just started recently and the only thing we have
 done is change our indexer from a PHP one to a Java one, but
 functionally they are identical.

 Any thoughts? Thanks in advance.

 - Rupert








Re: Responses getting truncated

2009-08-25 Thread Rupert Fiasco
 1.  Exactly which version of Solr / SolrJ are you using?

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Latest SolrJ that I downloaded a couple of days ago.

 Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
 file that this solr doc came from online somewhere?

We are running an instance of MediaWiki so the text goes through a
couple of transformations: wiki markup - html - plain text.
Its at this last step that I take a snippet and insert that into Solr.

My snippet code is:

 // article.java
public String getSnippet(int maxlen) {
  int length = getPlainText().length() = maxlen ? maxlen :
getPlainText().length();
  return getPlainText().substring(0, length);
}
// ... later on  add to solr
doc.addField(text_snippet_t, article.getSnippet(1000));

So in theory, I am getting the whole article if its less than 1K chars
and a maximum of 1K chars if its bigger. I initialized this String
from the DB by using the String constructor where I pass in the
charset/collation

text = new String(textFromDB, UTF-8);

So to the best of my knowledge, accessing a substring of a UTF-8
encoded string should not break up the UTF-8 code point. Is that an
incorrect assumption? If so, what is best way to break up a UTF-8
encoded string and get approximately that many characters? Exactness
is not a requirement.

-Rupert

On Tue, Aug 25, 2009 at 5:37 PM, Chris
Hostetterhossman_luc...@fucit.org wrote:

 1.  Exactly which version of Solr / SolrJ are you using?

 2. ...

 :  I am using the SolrJ client to add documents to in my index. My field
 :  is a normal text field type and the text itself is the first 1000
 :  characters of an article.

 Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
 file that this solr doc came from online somewhere?

 What does your *indexing* code look like? ... Can you add some debuging to
 the SolrJ client when you *add* this doc to print out exactly what those
 1000 characters are?

 My hunch: when you are extracting the first 1000 characters, you're
 getting only the first half of a character ...or... you are getting docs
 with less them 1000 characters and winding up with a buffer (char[]?) that
 has garbage at the end; SolrJ isn't complaining on the way in, but
 something farther down (maybe before indexing, maybe after) is seeing that
 garbage and cutting the field off at that point.



 -Hoss




Responses getting truncated

2009-08-24 Thread Rupert Fiasco
I am seeing our responses getting truncated if and only if I search on
our main text field.

E.g. I just do some basic like

title_t:arthritis

Then I get a valid document back. But if I add in our larger text field:

title_t:arthritis OR text_t:arthritis

then the resultant document is NOT valid XML (if using wt=xml) or Ruby
(using wt=ruby). If I run these through curl on the command its
truncated and if I run the search through the web-based admin panel
then I get an XML parse error.

This appears to have just started recently and the only thing we have
done is change our indexer from a PHP one to a Java one, but
functionally they are identical.

Any thoughts? Thanks in advance.

- Rupert


Question mark glyphs in indexed content

2009-08-10 Thread Rupert Fiasco
Hello, I am using the latest Solr4j to index content. When I look at
that content in the Solr Admin web utility I see weird characters like
this:

http://brockwine.com/images/solrglyphs.png

When I look at the text in the MySQL DB those chars appear to just be
plain hyphens. The MySQL table character set is utf8 and the collation
is utf8.

Environment:
OS X 10.5.8
java version 1.5.0_19
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02-304)
Java HotSpot(TM) Client VM (build 1.5.0_19-137, mixed mode, sharing)

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Lucene Specification Version: 2.4-dev
Lucene Implementation Version: 2.4-dev 691741 - 2008-09-03 15:25:16

Jetty 6.1.3

Any thoughts?

Thanks
/Rupert


Indexing issue with XML control characters

2009-07-20 Thread Rupert Fiasco
During indexing I will often get this error:

SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
character ((CTRL-CHAR, code 3))
 at [row,col {unknown-source}]: [2,1]
at 
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)


By looking at this list and elsewhere I know that I need to filter out
most control characters so I have been employing this regex:

/[\x00-\x08\x0B\x0C\x0E-\x1F]/

But I still get the error. What is strange is that if I re-run my
indexing process after a failure it will work on the previously failed
node and then error out on another node some time later. That is, it
is not deterministic. If I look at the text that is attempted to be
indexed its pure as you can get one (a bunch of medical keywords like
leg bones and nose).

Any ideas would be greatly appreciated.

The platform is:

Solr implementation version: 1.3.0 694707
Lucene implementation version: 2.4-dev 691741
Mac OS X 10.5.7
JVM 1.5.0_19-b02-304


Thanks
/Rupert


search returns matches for non-starting wildcard prefix queries

2009-02-09 Thread Rupert Fiasco
(I think I have a horrible subject line but I wasnt sure how to
properly explain myself).

I have a text field that I store last names in (and everything is
lowercased prior to insertion, not sure if that matters).

The field is described as:

   field name=last_name type=text indexed=true stored=false
multiValued=true/
fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType



When running a query such as

last_name:m*

I get data back like:

Pashman, Md
Maldonado
Manolidis
Fleisher, M.D., D.Ht., D.A.B.F.M.
Merino
Monroe
McLay
Maltsberger
McMurtray
Murphy Md
Loeb Md


As you can see most are perfect matches, but there are some that
*dont* start with the letter M but do have M at the beginning of
another word in the field.

Wouldnt the query m* just query for matches where the first letter
is M in the whole field and not within another word in that field?

Do I need to make another field to store last names and not perform
any analysis on that field (akin to a spell check field)?

Thanks in advance.

-Rupert


Issuing just a spell check query

2009-02-06 Thread Rupert Fiasco
The docs for the SpellCheckComponent say

The SpellCheckComponent is designed to provide inline spell checking
of queries without having to issue separate requests.

I would like to issue just a spell check query, I dont care about it
being inline and piggy-backing off a normal search query.

How would I achieve this?

I tried monkeying with making a new requestHandler but using class =
solr.SearchHandler always tries to do a normal search.

I succeeded in adding inline spell checking to the default request
handler by *adding*

 arr name=last-components
   strspellcheck/str
 /arr

to its requestHandler config - I would like to *remove* the default
search component - maybe by making a new request handler which just
does spell checking?

Is something like this possible?



  requestHandler name=/spellcheck class=solr.SearchHandler
!-- default values for query parameters --
 lst name=defaults
str name=spellcheck.count5/str
 /lst

 arr name=first-components
   strspellcheck/str
 /arr

 !-- remove default search component --
 arr name=remove-components
   strdefault/str
 /arr

  /requestHandler





Now, I can sort of achieve what I want by in fact a normal search but
then using a dummy value for my q parameter (for me 00
works) and then I get no search docs back, but I do get the spell
suggestions I want, driven by the spellcheck.q parameter.

But this seems very hacky and Solr is still having to run a search
against my dummy value.

A roundabout way of asking: how can I fire off *just* a spell check query?

Thanks in advance
-Rupert


Re: Issuing just a spell check query

2009-02-06 Thread Rupert Fiasco
But its deprecated (??)

-Rupert

On Fri, Feb 6, 2009 at 11:51 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Rupert,

 You could use the SpellCheck*Handler* to achieve this.


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




 
 From: Rupert Fiasco rufia...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, February 6, 2009 2:47:19 PM
 Subject: Issuing just a spell check query

 The docs for the SpellCheckComponent say

 The SpellCheckComponent is designed to provide inline spell checking
 of queries without having to issue separate requests.

 I would like to issue just a spell check query, I dont care about it
 being inline and piggy-backing off a normal search query.

 How would I achieve this?

 I tried monkeying with making a new requestHandler but using class =
 solr.SearchHandler always tries to do a normal search.

 I succeeded in adding inline spell checking to the default request
 handler by *adding*

 arr name=last-components
   strspellcheck/str
 /arr

 to its requestHandler config - I would like to *remove* the default
 search component - maybe by making a new request handler which just
 does spell checking?

 Is something like this possible?



  requestHandler name=/spellcheck class=solr.SearchHandler
!-- default values for query parameters --
 lst name=defaults
str name=spellcheck.count5/str
 /lst

 arr name=first-components
   strspellcheck/str
 /arr

 !-- remove default search component --
 arr name=remove-components
   strdefault/str
 /arr

  /requestHandler





 Now, I can sort of achieve what I want by in fact a normal search but
 then using a dummy value for my q parameter (for me 00
 works) and then I get no search docs back, but I do get the spell
 suggestions I want, driven by the spellcheck.q parameter.

 But this seems very hacky and Solr is still having to run a search
 against my dummy value.

 A roundabout way of asking: how can I fire off *just* a spell check query?

 Thanks in advance
 -Rupert



Spell checking not returning full terms

2009-02-04 Thread Rupert Fiasco
We are using Solr 1.3 and trying to get spell checking functionality.

FYI, our index contains a lot of medical terms (which might or might
not make a difference as they are not English-y words, if that makes
any sense?)

If I specify a spellcheck query of spellcheck.q=diabtes

I get suggestions of:

strdiabet/str
strdiabetogen/str
strdilat/str
strdiamet/str
strdiatom/str
strdiastol/str
strdiactin/str
strdialect/str

If I re-mis-spell Diabetes to q=diabets then I go no suggestions.

So first off two things:

1) Why would leaving out one e over the other affect the spelling
suggestions so substantially?
2) In the former list of suggestions, notice the first suggestion is
diabet, which isnt all that helpful, it should return something like
diabetes or maybe even diabetic.

Note that if I do a normal search against diabetes then I get a ton
of results, in other words, our index is filled with terms of
diabetes.

My relevant solrconfig is:


str name=queryAnalyzerFieldTypetext/str

lst name=spellchecker
  str name=namedefault/str
  str name=fieldtext_t/str
  str name=spellcheckIndexDir./spellchecker1/str
  str name=accuracy0.1/str

/lst
lst name=spellchecker
  str name=namejarowinkler/str
  str name=fieldtext_t/str
  !-- Use a different Distance Measure --
  str 
name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str
  str name=spellcheckIndexDir./spellchecker2/str
  str name=accuracy0.1/str

/lst

and I have

spellcheck.count = 8

Notice that I severely bumped down the accuracy setting to get more
results. Bumping it up higher yields less results (not sure what
setting really meant so I dont know in what direction I want to change
that value - I am guessing that a lower value allows for more
mis-spellings, e.g. its more promiscuous).

Our text and text_t fields are defined in schema.xml as:

field name=text type=text indexed=true stored=false
multiValued=true/
and
dynamicField name=*_t type=text   indexed=true
stored=true multiValued=true /

Any help would be appreciated.

Thanks
-Rupert


Re: Spell checking not returning full terms

2009-02-04 Thread Rupert Fiasco
Awesome! After reading up on the links you sent me I got it all working. Thanks!

FYI - I did previously come across one of the links you sent over:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

But what threw me off is that when I started reading about that
yesterday, in the first paragraph it says that this component is
deprecated and to use SpellCheckComponent - so at that point I stopped
reading and went over to the component page. If I had kept reading I
would have encountered all of the gritty details that I in fact needed
to get it to work. The wiki entry makes it seem old and deprecated and
is no longer relevant, but it certainly is.

-Rupert

On Wed, Feb 4, 2009 at 11:57 AM, Grant Ingersoll gsing...@apache.org wrote:
 I'm guessing the field you are checking against is being stemmed.  The field
 you spell check against should have minimal analysis done to it, i.e.
 tokenization and probably downcasing.  See
 http://wiki.apache.org/solr/SpellCheckComponent and
 http://wiki.apache.org/solr/SpellCheckerRequestHandler for tips on how to
 handle analysis for spelling.

 On Feb 4, 2009, at 2:33 PM, Rupert Fiasco wrote:

 We are using Solr 1.3 and trying to get spell checking functionality.

 FYI, our index contains a lot of medical terms (which might or might
 not make a difference as they are not English-y words, if that makes
 any sense?)

 If I specify a spellcheck query of spellcheck.q=diabtes

 I get suggestions of:

 strdiabet/str
 strdiabetogen/str
 strdilat/str
 strdiamet/str
 strdiatom/str
 strdiastol/str
 strdiactin/str
 strdialect/str

 If I re-mis-spell Diabetes to q=diabets then I go no suggestions.

 So first off two things:

 1) Why would leaving out one e over the other affect the spelling
 suggestions so substantially?
 2) In the former list of suggestions, notice the first suggestion is
 diabet, which isnt all that helpful, it should return something like
 diabetes or maybe even diabetic.

 Note that if I do a normal search against diabetes then I get a ton
 of results, in other words, our index is filled with terms of
 diabetes.

 My relevant solrconfig is:


   str name=queryAnalyzerFieldTypetext/str

   lst name=spellchecker
 str name=namedefault/str
 str name=fieldtext_t/str
 str name=spellcheckIndexDir./spellchecker1/str
 str name=accuracy0.1/str

   /lst
   lst name=spellchecker
 str name=namejarowinkler/str
 str name=fieldtext_t/str
 !-- Use a different Distance Measure --
 str
 name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str
 str name=spellcheckIndexDir./spellchecker2/str
 str name=accuracy0.1/str

   /lst

 and I have

 spellcheck.count = 8

 Notice that I severely bumped down the accuracy setting to get more
 results. Bumping it up higher yields less results (not sure what
 setting really meant so I dont know in what direction I want to change
 that value - I am guessing that a lower value allows for more
 mis-spellings, e.g. its more promiscuous).

 Our text and text_t fields are defined in schema.xml as:

 field name=text type=text indexed=true stored=false
 multiValued=true/
 and
 dynamicField name=*_t type=text   indexed=true
 stored=true multiValued=true /

 Any help would be appreciated.

 Thanks
 -Rupert

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ














Understanding prefix query searching

2008-10-21 Thread Rupert Fiasco
So I tried to look on google for an answer to this before I posted
here. Basically I am trying to understand how prefix searching works.

I have a dynamic text field (indexed and stored) full_name_t

I have some data in my index, specifically a record with full_name_t =
Robert P Page

A search on:

full_name_t:Robert

yields that document, however a search on

full_name_t:Robert*

yields nothing.

Why?

To get around this I am doing something like

(full_name_t:Robert OR full_name_t:Robert*)

But I would like to understand why the wildcard doesnt work, shouldn't
it match anything after the first characters of Robert?

Thanks

-Rupert