Distributed search component.

2011-04-04 Thread Rok Rejc
Hi all,

I am trying to create a distributed search component in solr which is quite
difficult (at least for me, because I am new in solr and java). Anyway I
have looked into solr source (FacetComponent, TermsComponent...) and created
my own search component (it extends SearchComponent) but I still have two
questions (for now):

1.) In the prepare method I have the following code:

String shards = params.get(ShardParams.SHARDS);
if (shards != null) {
ListString lst = StrUtils.splitSmart(shards, ,, true);
rb.shards = lst.toArray(new String[lst.size()]);
rb.isDistrib = true;
}

If I remove rb.isDistrib = true; line the distributed methods are not
called. But to set the isDistrib my code must be in the
org.apache.solr.handler.component package (because it is not visible from
the outside). Is this  correct procedure/behaviour/design?

2.) Functions (process, distributedProcess, handleResponses...) are all
called properly. I can read partial responses in the handleResponses but I
don't know how to build final response. I see that for example
TermsComponent has a helper in the ResponseBuilder which collects all the
terms. Is this the only way (to edit the ResponseBuilder source), or can I
achive that without editing the solr's source?

Many thanks,

Rok


Re: Faceting on multivalued field

2011-04-04 Thread Kaushik Chakraborty
Are you implying to change the DB query of the nested entity which fetches
the comments (query is in my post) or something can be done during the index
like using Transformers etc. ?

Thanks,
Kaushik


On Mon, Apr 4, 2011 at 8:07 AM, Erick Erickson erickerick...@gmail.comwrote:

 Why not count them on the way in and just store that number along
 with the original e-mail?

 Best
 Erick

 On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty kaych...@gmail.com
 wrote:

  Ok. My expectation was since comment_post_id is a MultiValued field
 hence
  it would appear multiple times (i.e. for each comment). And hence when I
  would facet with that field it would also give me the count of those many
  documents where comment_post_id appears.
 
  My requirement is getting total for every document i.e. finding number of
  comments per post in the whole corpus. To explain it more clearly, I'm
  getting a result xml something like this
 
  str name=post_id46/str
  str name=post_textHello World/str
  str name=person_id20/str
  arr name=comment_id
 str9/str
 str10/str
  /arr
  arr name=comment_person_id
str19/str
str2/str
  /arr
  arr name=comment_post_id
   str46/str
   str46/str
  /arr
  arr name=comment_text
strHello - from World/str
strHi/str
  /arr
 
  lst name=facet_fields
   lst name=comment_post_id
  *int name=461/int*
 
  I need the count to be 2 as the post 46 has 2 comments.
 
   What other way can I approach?
 
  Thanks,
  Kaushik
 
 
  On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Hmmm, I think you're misunderstanding faceting. It's counting the
   number of documents that have a particular value. So if you're
   faceting on comment_post_id, there is one and only one document
   with that value (assuming that the comment_post_ids are unique).
   Which is what's being reported This will be quite expensive on a
   large corpus, BTW.
  
   Is your task to show the totals for *every* document in your corpus or
   just the ones in a display page? Because if the latter, your app could
   just count up the number of elements in the XML returned for the
   multiValued comments field.
  
   If that's not relevant, could you explain a bit more why you need this
   count?
  
   Best
   Erick
  
   On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty 
 kaych...@gmail.com
   wrote:
  
Hi,
   
My index contains a root entity Post and a child entity Comments.
   Each
post can have multiple comments. data-config.xml:
   
document
   entity name=posts transformer=TemplateTransformer
dataSource=jdbc query=
   
   field column=post_id /
   field column=post_text/
   field column=person_id/
   entity name=comments dataSource=jdbc
 query=select
  *
from comments where post_id = ${posts.post_id} 
   field column=comment_id /
   field column=comment_text /
   field column=comment_person_id /
   field column=comment_post_id /
  /entity
   /entity
/document
   
The schema has all columns of comment entity as MultiValued
 fields
   and
all fields are indexed  stored. My requirement is to count the
 number
  of
comments for each post. Approach I'm taking is to query on *:* and
faceting the result on comment_post_id so that it gives the count
 of
comment occurred for that post.
   
But I'm getting incorrect result e.g. if a post has 2 comments, the
multivalued fields are populated alright but the facet count is
 coming
  as
   1
(for that post_id). What else do I need to do?
   
   
Thanks,
Kaushik
   
  
 



Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

 

I would like to hear your opinion about the MLT feature and if it's a
good solution to what I need to implement.

 

My index has fields like: headline, body and medianame.

What I need to do is, before adding a new doc, verify if a similar doc
exists for this media.

 

My idea is to use the MorelikeThisHandler
(http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:

 

For each new doc, perform a MLT search with q= medianame and
stream.body=headline+bodytext.

If no similar docs are found than I can safely add the doc.

 

Is this feasible using the MLT handler? Is it a good approach? Are there
a better way to perform this comparison?

 

Thank you for your help.

 

Best regards,



Frederico Azeiteiro

 



Re: Using MLT feature

2011-04-04 Thread Chris Fauerbach
Do you want to not index if something similar? Or don't index if exact.   I 
would look into a hash code of the document if you don't want to index exact.   
 Similar though, I think has to be based off a document in the index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com 
wrote:

 Hi,
 
 
 
 I would like to hear your opinion about the MLT feature and if it's a
 good solution to what I need to implement.
 
 
 
 My index has fields like: headline, body and medianame.
 
 What I need to do is, before adding a new doc, verify if a similar doc
 exists for this media.
 
 
 
 My idea is to use the MorelikeThisHandler
 (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:
 
 
 
 For each new doc, perform a MLT search with q= medianame and
 stream.body=headline+bodytext.
 
 If no similar docs are found than I can safely add the doc.
 
 
 
 Is this feasible using the MLT handler? Is it a good approach? Are there
 a better way to perform this comparison?
 
 
 
 Thank you for your help.
 
 
 
 Best regards,
 
 
 
 Frederico Azeiteiro
 
 
 


Mongo REST interface and full data import

2011-04-04 Thread andrew_s
Hi everyone,

I'm trying to make a simple data import from MongoDB into Solr using REST
interface.

As an test example I've created schecma.xml like:
?xml version=1.0 ?

  
   
  

 
  
  
  
  
 

 
 isbn

 
 title

 
 



and data-import.xml as:












Unfortunately it's not working and I'm stuck  on this place.

Could you please advise how correctly parser JSON format data?


Data format looks like:
{
  offset : 0,
  rows: [
{ _id : { $oid : 4d9829412c8bd1064400 }, isbn : 716739356,
title : Proteins, description :  } ,
{ _id : { $oid : 4d9829412c8bd1064401 }, isbn :
144433056X, title : How to Assess Doctors and Health Professionals,
description :  } ,
{ _id : { $oid : 4d9829412c8bd1064402 }, isbn :
1406208159, title : Freestyle: Time Travel Guides: Pack B,
description : Takes you on a trip through history to visit the great
ancient civilisations. } ,
  total_rows : 3 ,
  query : {} ,
  millis : 0
}


Thank you.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

The ideia is don't index if something similar (headline+bodytext) for
the same exact medianame.

Do you mean I would need to index the doc first (maybe in a temp index)
and then use the MLT feature to find similar docs before adding to final
index?

Thanks,
Frederico


-Original Message-
From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] 
Sent: segunda-feira, 4 de Abril de 2011 10:22
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Do you want to not index if something similar? Or don't index if exact.
I would look into a hash code of the document if you don't want to index
exact.Similar though, I think has to be based off a document in the
index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro
frederico.azeite...@cision.com wrote:

 Hi,
 
 
 
 I would like to hear your opinion about the MLT feature and if it's a
 good solution to what I need to implement.
 
 
 
 My index has fields like: headline, body and medianame.
 
 What I need to do is, before adding a new doc, verify if a similar doc
 exists for this media.
 
 
 
 My idea is to use the MorelikeThisHandler
 (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
way:
 
 
 
 For each new doc, perform a MLT search with q= medianame and
 stream.body=headline+bodytext.
 
 If no similar docs are found than I can safely add the doc.
 
 
 
 Is this feasible using the MLT handler? Is it a good approach? Are
there
 a better way to perform this comparison?
 
 
 
 Thank you for your help.
 
 
 
 Best regards,
 
 
 
 Frederico Azeiteiro
 
 
 


Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Thanks Chris, 

The field used for indexing and spellcheck is the same and is configured
like this:..


fieldType name=title stored=true indexed=true multiValued=false 
class=solr.TextField 
   analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory
pattern=^([^!]+)\!([^!]+)$
replacement=$1i$2
replace=all/ 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=1 
splitOnCaseChange=1 preserveOriginal=1/
 filter class=solr.ASCIIFoldingFilterFactory/
   /analyzer
/fieldType


I use the pattern replace filter to swap all instances of ! within a
word to i.  I know this part is working correctly as performing a
search works correctly.

The spellcheck is initialized like this:


searchComponent name=spellcheck class=solr.SpellCheckComponent
   str name=queryAnalyzerFieldTypetitle/str
   lst name=spellchecker
  str name=namedefault/str
  str name=fieldsearchfield/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=buildOnCommitfalse/str
   /lst
/searchComponent

And is attached to as a component to my search handler.

Thanks,

Colin


 : I'm having an issue performing a spellcheck on some information and
 : search of the archive isn't helping.
 
 For this type of quesiton, there's not much feedback anyone can offer w/o 
 knowing exactly what analyzers you have configured for hte various 
 fieldtypes (both the field you index/search and the fieldtype used for 
 spellchecking)
 
 it's also fairly critical to know how you have the spellcheck component 
 configured.
 
 off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
 wonky way given your usecase -- but like i said: would need to see the 
 configs to make a guess.
 
 
 -Hoss
 
 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 




__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Thanks Chris, 

The field used for indexing and spellcheck is the same and is configured
like this:..


fieldType name=title stored=true indexed=true multiValued=false 
class=solr.TextField 
   analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory
pattern=^([^!]+)\!([^!]+)$
replacement=$1i$2
replace=all/ 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=1 
splitOnCaseChange=1 preserveOriginal=1/
 filter class=solr.ASCIIFoldingFilterFactory/
   /analyzer
/fieldType


I use the pattern replace filter to swap all instances of ! within a
word to i.  I know this part is working correctly as performing a
search works correctly.

The spellcheck is initialized like this:


searchComponent name=spellcheck class=solr.SpellCheckComponent
   str name=queryAnalyzerFieldTypetitle/str
   lst name=spellchecker
  str name=namedefault/str
  str name=fieldsearchfield/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=buildOnCommitfalse/str
   /lst
/searchComponent


This is attached as a component to my search handler and spellchecking
is done inline with the queries.

Thanks,

Colin



 : I'm having an issue performing a spellcheck on some information and
 : search of the archive isn't helping.
 
 For this type of quesiton, there's not much feedback anyone can offer w/o 
 knowing exactly what analyzers you have configured for hte various 
 fieldtypes (both the field you index/search and the fieldtype used for 
 spellchecking)
 
 it's also fairly critical to know how you have the spellcheck component 
 configured.
 
 off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
 wonky way given your usecase -- but like i said: would need to see the 
 configs to make a guess.
 
 
 -Hoss
 
 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 






__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Re: Spellchecking Escaped Queries

2011-04-04 Thread Colin Vipurs
Apologies for the duplicate post.  I'm having Evolution problems


 Thanks Chris, 
 
 The field used for indexing and spellcheck is the same and is
 configured like this:..
 
 
 fieldType name=title stored=true indexed=true multiValued=false 
 class=solr.TextField 
analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PatternReplaceFilterFactory
   pattern=^([^!]+)\!([^!]+)$
   replacement=$1i$2
   replace=all/ 
  filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=1 generateNumberParts=1 catenateWords=1 
 catenateNumbers=0 catenateAll=1 splitOnCaseChange=1 
 preserveOriginal=1/
  filter class=solr.ASCIIFoldingFilterFactory/
/analyzer
 /fieldType
 
 
 I use the pattern replace filter to swap all instances of ! within a
 word to i.  I know this part is working correctly as performing a
 search works correctly.
 
 The spellcheck is initialized like this:
 
 
 searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetitle/str
lst name=spellchecker
   str name=namedefault/str
   str name=fieldsearchfield/str
   str name=spellcheckIndexDir./spellchecker/str
   str name=buildOnCommitfalse/str
/lst
 /searchComponent
 
 And is attached to as a component to my search handler.
 
 Thanks,
 
 Colin
 
 
  : I'm having an issue performing a spellcheck on some information and
  : search of the archive isn't helping.
  
  For this type of quesiton, there's not much feedback anyone can offer w/o 
  knowing exactly what analyzers you have configured for hte various 
  fieldtypes (both the field you index/search and the fieldtype used for 
  spellchecking)
  
  it's also fairly critical to know how you have the spellcheck component 
  configured.
  
  off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a 
  wonky way given your usecase -- but like i said: would need to see the 
  configs to make a guess.
  
  
  -Hoss
  
  __
  This email has been scanned by the MessageLabs Email Security System.
  For more information please visit http://www.messagelabs.com/email 
  __
 
 
 -- 
 
 
 Colin Vipurs
 Server Team Lead
 
 Shazam Entertainment Ltd   
 26-28 Hammersmith Grove, London W6 7HA
 m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
 w:www.shazam.com
 
 Please consider the environment before printing this document
 
 This e-mail and its contents are strictly private and confidential. It
 must not be disclosed, distributed or copied without our prior
 consent. If you have received this transmission in error, please
 notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and
 then delete it from your system. Please note that the information
 contained herein shall additionally constitute Confidential
 Information for the purposes of any NDA between the recipient/s and
 Shazam Entertainment. Shazam Entertainment Limited is incorporated in
 England and Wales under company number 3998831 and its registered
 office is at 26-28 Hammersmith Grove, London W6 7HA. 
 
 
 
 
 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __
 
 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 




__
This email has been scanned by the MessageLabs Email 

Re: Using MLT feature

2011-04-04 Thread Markus Jelsma
http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
 Hi,
 
 The ideia is don't index if something similar (headline+bodytext) for
 the same exact medianame.
 
 Do you mean I would need to index the doc first (maybe in a temp index)
 and then use the MLT feature to find similar docs before adding to final
 index?
 
 Thanks,
 Frederico
 
 
 -Original Message-
 From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
 Sent: segunda-feira, 4 de Abril de 2011 10:22
 To: solr-user@lucene.apache.org
 Subject: Re: Using MLT feature
 
 Do you want to not index if something similar? Or don't index if exact.
 I would look into a hash code of the document if you don't want to index
 exact.Similar though, I think has to be based off a document in the
 index.
 
 On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 
 frederico.azeite...@cision.com wrote:
  Hi,
  
  
  
  I would like to hear your opinion about the MLT feature and if it's a
  good solution to what I need to implement.
  
  
  
  My index has fields like: headline, body and medianame.
  
  What I need to do is, before adding a new doc, verify if a similar doc
  exists for this media.
  
  
  
  My idea is to use the MorelikeThisHandler
  (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
 
 way:
  For each new doc, perform a MLT search with q= medianame and
  stream.body=headline+bodytext.
  
  If no similar docs are found than I can safely add the doc.
  
  
  
  Is this feasible using the MLT handler? Is it a good approach? Are
 
 there
 
  a better way to perform this comparison?
  
  
  
  Thank you for your help.
  
  
  
  Best regards,
  
  
  
  Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


help with Jetty log message

2011-04-04 Thread Matthieu Huin

Greetings all,

I am currently using solr as the backend behind a log aggregation and 
search system my team is developing. All was well and good until I 
noticed a test server crashing quite unexpectedly. We'd like to dig more 
into the incident but none of us has much experience with Jetty crash 
logs - not to mention that our Java is very rusty.


The crash log is joined as an attachment.

Could anyone help us with understanding what went wrong there ?

Also, would it be possible and/or wise to automatically restart the 
server in case of such a crash ?



Thanks for your help. If you need any extra info about that case, do not 
hesitate to ask !



Matthieu Huin


#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f051a618105, pid=5033, tid=1092958544
#
# JRE version: 6.0_18-b18
# Java VM: OpenJDK 64-Bit Server VM (16.0-b13 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.8.3
# Distribution: Debian GNU/Linux 5.0.8 (lenny), package 6b18-1.8.3-2~lenny1
# Problematic frame:
# V  [libjvm.so+0x5dc105]
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#

---  T H R E A D  ---

Current thread (0x0207d800):  GCTaskThread [stack: 0x41153000,0x41254000] [id=5036]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (), si_addr=0x

Registers:
RAX=0x, RBX=0x7f04acba89a8, RCX=0x020d85d8, RDX=0x0030002e00300031
RSP=0x41252eb0, RBP=0x41252f20, RSI=0x, RDI=0x0030002e00300041
R8 =0x04a3523e2a33, R9 =0x7f051aae7188, R10=0x0001, R11=0x41252da0
R12=0x7f04f15b4368, R13=0x0035003000360034, R14=0x41252f50, R15=0x020d8070
RIP=0x7f051a618105, EFL=0x00010246, CSGSFS=0x0033, ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x41252eb0)
0x41252eb0:   04a3523e2a01 7f051aae7188
0x41252ec0:   04a00c960001e082 0004
0x41252ed0:   04a3523e2a33 0400
0x41252ee0:   04a3523e2a32 
0x41252ef0:   4097fb58 7f04acba89a8
0x41252f00:   020d8020 
0x41252f10:   41252f50 41252f5c
0x41252f20:   41252f90 7f051a61cb78
0x41252f30:   02196810 020d8070
0x41252f40:   0207d800 7f051a5a6f3b
0x41252f50:   7f04acba89a8 7b6e9b2f0207cf00
0x41252f60:   41252f90 02196810
0x41252f70:   0207d800 7f051a75254f
0x41252f80:    0207da90
0x41252f90:   41253070 7f051a3b4a10
0x41252fa0:   0207d800 41252fd0
0x41252fb0:   41253030 0207dac0
0x41252fc0:   0207dad0 0207dea8
0x41252fd0:   0207d800 0207deb0
0x41252fe0:   0207dee0 0207def0
0x41252ff0:   0207e2c8 41253000
0x41253000:   0207d800 0207deb0
0x41253010:   0207dee0 0207def0
0x41253020:   0207e2c8 0207e2d0
0x41253030:    
0x41253040:   0207ec30 
0x41253050:   0207ec30 0207eb50
0x41253060:   0207d800 1000
0x41253070:   41253140 7f051a5ce090
0x41253080:    
0x41253090:    
0x412530a0:     

Instructions: (pc=0x7f051a618105)
0x7f051a6180f5:   f6 0f 85 d4 00 00 00 49 8b 54 24 08 48 8d 7a 10
0x7f051a618105:   8b 4f 08 83 f9 00 0f 8e e4 00 00 00 89 c8 c1 f8 

Stack: [0x41153000,0x41254000],  sp=0x41252eb0,  free space=3ff0018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5dc105]
V  [libjvm.so+0x5e0b78]
V  [libjvm.so+0x378a10]
V  [libjvm.so+0x592090]


---  P R O C E S S  ---

Java Threads: ( = current thread )
  0x0540f000 JavaThread btpool0-12 [_thread_blocked, id=6839, stack(0x42623000,0x42724000)]
  0x0234a800 JavaThread btpool0-11 [_thread_blocked, id=6796, stack(0x42522000,0x42623000)]
  0x02754000 JavaThread btpool0-10 [_thread_blocked, id=6761, stack(0x42421000,0x42522000)]
  0x0246e800 JavaThread TimeLimitedCollector timer thread daemon [_thread_blocked, id=5307, stack(0x4232,0x42421000)]
  0x02317800 JavaThread MultiThreadedHttpConnectionManager cleanup daemon [_thread_blocked, id=5306, 

Re: help with Jetty log message

2011-04-04 Thread Upayavira
This is not Solr crashing, per se, it is your JVM. I personally haven't
generally had much success debugging these kinds of failure - see
whether it happens again, and if it does, try updating your
JVM/switching to another/etc.

Anyone have better advice?

Upayavira

On Mon, 04 Apr 2011 11:59 +0200, Matthieu Huin
matthieu.h...@wallix.com wrote:
 Greetings all,
 
 I am currently using solr as the backend behind a log aggregation and 
 search system my team is developing. All was well and good until I 
 noticed a test server crashing quite unexpectedly. We'd like to dig more 
 into the incident but none of us has much experience with Jetty crash 
 logs - not to mention that our Java is very rusty.
 
 The crash log is joined as an attachment.
 
 Could anyone help us with understanding what went wrong there ?
 
 Also, would it be possible and/or wise to automatically restart the 
 server in case of such a crash ?
 
 
 Thanks for your help. If you need any extra info about that case, do not 
 hesitate to ask !
 
 
 Matthieu Huin
 
 
 
 Email had 1 attachment:
 + hs_err_pid5033.log
   26k (text/x-log)
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:
updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignature/str
  str name=fieldsheadline,body,medianame/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

2. Add the request as the default update request 
3. Add a signature indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
 Hi,
 
 The ideia is don't index if something similar (headline+bodytext) for
 the same exact medianame.
 
 Do you mean I would need to index the doc first (maybe in a temp index)
 and then use the MLT feature to find similar docs before adding to final
 index?
 
 Thanks,
 Frederico
 
 
 -Original Message-
 From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
 Sent: segunda-feira, 4 de Abril de 2011 10:22
 To: solr-user@lucene.apache.org
 Subject: Re: Using MLT feature
 
 Do you want to not index if something similar? Or don't index if exact.
 I would look into a hash code of the document if you don't want to index
 exact.Similar though, I think has to be based off a document in the
 index.
 
 On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 
 frederico.azeite...@cision.com wrote:
  Hi,
  
  
  
  I would like to hear your opinion about the MLT feature and if it's a
  good solution to what I need to implement.
  
  
  
  My index has fields like: headline, body and medianame.
  
  What I need to do is, before adding a new doc, verify if a similar doc
  exists for this media.
  
  
  
  My idea is to use the MorelikeThisHandler
  (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
 
 way:
  For each new doc, perform a MLT search with q= medianame and
  stream.body=headline+bodytext.
  
  If no similar docs are found than I can safely add the doc.
  
  
  
  Is this feasible using the MLT handler? Is it a good approach? Are
 
 there
 
  a better way to perform this comparison?
  
  
  
  Thank you for your help.
  
  
  
  Best regards,
  
  
  
  Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Mongo REST interface and full data import

2011-04-04 Thread Erick Erickson
I'm having trouble seeing your schema files, etc. I don't
know if gmail is stripping this on my end or whether
your e-mail is stripping it on upload, anyone else seeing this?

But to your question, what version are you using? From
Solr3.1 http://wiki.apache.org/solr/Solr3.1 is the first version with JSON
support for updates.

See: http://wiki.apache.org/solr/UpdateJSON

http://wiki.apache.org/solr/UpdateJSONBest
Erick

On Mon, Apr 4, 2011 at 5:31 AM, andrew_s sharov1...@gmail.com wrote:

 Hi everyone,

 I'm trying to make a simple data import from MongoDB into Solr using REST
 interface.

 As an test example I've created schecma.xml like:
 ?xml version=1.0 ?













  isbn


  title






 and data-import.xml as:












 Unfortunately it's not working and I'm stuck  on this place.

 Could you please advise how correctly parser JSON format data?


 Data format looks like:
 {
  offset : 0,
  rows: [
{ _id : { $oid : 4d9829412c8bd1064400 }, isbn : 716739356,
 title : Proteins, description :  } ,
{ _id : { $oid : 4d9829412c8bd1064401 }, isbn :
 144433056X, title : How to Assess Doctors and Health Professionals,
 description :  } ,
{ _id : { $oid : 4d9829412c8bd1064402 }, isbn :
 1406208159, title : Freestyle: Time Travel Guides: Pack B,
 description : Takes you on a trip through history to visit the great
 ancient civilisations. } ,
  total_rows : 3 ,
  query : {} ,
  millis : 0
 }


 Thank you.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Faceting on multivalued field

2011-04-04 Thread Jonathan Rochkind
Is there a kind of function query that can count number of values in a 
multi-valued field on a given document?  I do not know. 

From: Erick Erickson [erickerick...@gmail.com]
Sent: Sunday, April 03, 2011 10:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Faceting on multivalued field

Why not count them on the way in and just store that number along
with the original e-mail?

Best
Erick

On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty kaych...@gmail.comwrote:

 Ok. My expectation was since comment_post_id is a MultiValued field hence
 it would appear multiple times (i.e. for each comment). And hence when I
 would facet with that field it would also give me the count of those many
 documents where comment_post_id appears.

 My requirement is getting total for every document i.e. finding number of
 comments per post in the whole corpus. To explain it more clearly, I'm
 getting a result xml something like this

 str name=post_id46/str
 str name=post_textHello World/str
 str name=person_id20/str
 arr name=comment_id
str9/str
str10/str
 /arr
 arr name=comment_person_id
   str19/str
   str2/str
 /arr
 arr name=comment_post_id
  str46/str
  str46/str
 /arr
 arr name=comment_text
   strHello - from World/str
   strHi/str
 /arr

 lst name=facet_fields
  lst name=comment_post_id
 *int name=461/int*

 I need the count to be 2 as the post 46 has 2 comments.

  What other way can I approach?

 Thanks,
 Kaushik


 On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Hmmm, I think you're misunderstanding faceting. It's counting the
  number of documents that have a particular value. So if you're
  faceting on comment_post_id, there is one and only one document
  with that value (assuming that the comment_post_ids are unique).
  Which is what's being reported This will be quite expensive on a
  large corpus, BTW.
 
  Is your task to show the totals for *every* document in your corpus or
  just the ones in a display page? Because if the latter, your app could
  just count up the number of elements in the XML returned for the
  multiValued comments field.
 
  If that's not relevant, could you explain a bit more why you need this
  count?
 
  Best
  Erick
 
  On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty kaych...@gmail.com
  wrote:
 
   Hi,
  
   My index contains a root entity Post and a child entity Comments.
  Each
   post can have multiple comments. data-config.xml:
  
   document
  entity name=posts transformer=TemplateTransformer
   dataSource=jdbc query=
  
  field column=post_id /
  field column=post_text/
  field column=person_id/
  entity name=comments dataSource=jdbc query=select
 *
   from comments where post_id = ${posts.post_id} 
  field column=comment_id /
  field column=comment_text /
  field column=comment_person_id /
  field column=comment_post_id /
 /entity
  /entity
   /document
  
   The schema has all columns of comment entity as MultiValued fields
  and
   all fields are indexed  stored. My requirement is to count the number
 of
   comments for each post. Approach I'm taking is to query on *:* and
   faceting the result on comment_post_id so that it gives the count of
   comment occurred for that post.
  
   But I'm getting incorrect result e.g. if a post has 2 comments, the
   multivalued fields are populated alright but the facet count is coming
 as
  1
   (for that post_id). What else do I need to do?
  
  
   Thanks,
   Kaushik
  
 



Re: Solrj performance bottleneck

2011-04-04 Thread rahul
Hi All,

I just to want to share some findings which clearly identified the reason
for our performance bottleneck. we had looked into several areas for
optimization mostly directed at Solr configurations, stored fields,
highlighting, JVM, OS cache etc. But it turned out that the main culprit
was elsewhere. We were using the terms component for auto suggestion and
while examining the firebug outputs for time taken during the searches, we
detected that multiple requests were being spawned for autosuggestion as we
typed in the keyword to search (1 request per each character typed) and this
in turn cost us great delay in getting the search results. Once we turned
auto suggestion off, the performance was remarkably better and came down to
a second or so (compared to 8-10 seconds registered earlier).

if anybody has some suggestions/experience on how to leverage autosuggestion
without affecting search performance much, please do share them.

Once again, thanks for your inputs in analyzing our issues.

Thanks,

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2775245.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi again,
I guess I was wrong on my early post... There's no automated way to avoid the 
indexation of the duplicate doc.

I guess I have 2 options: 

1. Create a temp index with signatures and then have an app that for each new 
doc verifies if sig exists on my primary index. 
If not, add the article.

2. Before adding the doc, create a signature (using the same algorithm that 
SOLR uses) on my indexing app and then verify if signature exists before adding.

I'm way thinking the right way here? :)

Thank you,
Frederico 
 


-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: segunda-feira, 4 de Abril de 2011 11:59
To: solr-user@lucene.apache.org
Subject: RE: Using MLT feature

Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:
updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignature/str
  str name=fieldsheadline,body,medianame/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

2. Add the request as the default update request 
3. Add a signature indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
 Hi,
 
 The ideia is don't index if something similar (headline+bodytext) for
 the same exact medianame.
 
 Do you mean I would need to index the doc first (maybe in a temp index)
 and then use the MLT feature to find similar docs before adding to final
 index?
 
 Thanks,
 Frederico
 
 
 -Original Message-
 From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
 Sent: segunda-feira, 4 de Abril de 2011 10:22
 To: solr-user@lucene.apache.org
 Subject: Re: Using MLT feature
 
 Do you want to not index if something similar? Or don't index if exact.
 I would look into a hash code of the document if you don't want to index
 exact.Similar though, I think has to be based off a document in the
 index.
 
 On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 
 frederico.azeite...@cision.com wrote:
  Hi,
  
  
  
  I would like to hear your opinion about the MLT feature and if it's a
  good solution to what I need to implement.
  
  
  
  My index has fields like: headline, body and medianame.
  
  What I need to do is, before adding a new doc, verify if a similar doc
  exists for this media.
  
  
  
  My idea is to use the MorelikeThisHandler
  (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
 
 way:
  For each new doc, perform a MLT search with q= medianame and
  stream.body=headline+bodytext.
  
  If no similar docs are found than I can safely add the doc.
  
  
  
  Is this feasible using the MLT handler? Is it a good approach? Are
 
 there
 
  a better way to perform this comparison?
  
  
  
  Thank you for your help.
  
  
  
  Best regards,
  
  
  
  Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Using MLT feature

2011-04-04 Thread Markus Jelsma

 Hi again,
 I guess I was wrong on my early post... There's no automated way to avoid
 the indexation of the duplicate doc.

Yes there is, try set overwriteDupes to true and documents yielding the same 
signature will be overwritten. If you have need both fuzzy and exact matching 
then add a second update processor inside the chain and create another 
signature field.

 
 I guess I have 2 options:
 
 1. Create a temp index with signatures and then have an app that for each
 new doc verifies if sig exists on my primary index. If not, add the
 article.
 
 2. Before adding the doc, create a signature (using the same algorithm that
 SOLR uses) on my indexing app and then verify if signature exists before
 adding.
 
 I'm way thinking the right way here? :)
 
 Thank you,
 Frederico
  
 
 
 -Original Message-
 From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
 Sent: segunda-feira, 4 de Abril de 2011 11:59
 To: solr-user@lucene.apache.org
 Subject: RE: Using MLT feature
 
 Thank you Markus it looks great.
 
 But the wiki is not very detailed on this.
 Do you mean if I:
 
 1. Create:
 updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
   bool name=overwriteDupesfalse/bool
   str name=signatureFieldsignature/str
   str name=fieldsheadline,body,medianame/str
   str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/s
 tr /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 2. Add the request as the default update request
 3. Add a signature indexed field to my schema.
 
 Then,
 When adding a new doc to my index, it is only added of not considered a
 duplicate using a Lookup3Signature on the field defined? All duplicates
 are ignored and not added to my index?
 Is it so simple as that?
 
 Does it works even if the medianame should be an exact match (not similar
 match as the headline and bodytext are)?
 
 Thank you for your help,
 
 
 Frederico Azeiteiro
 Developer
  
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: segunda-feira, 4 de Abril de 2011 10:48
 To: solr-user@lucene.apache.org
 Subject: Re: Using MLT feature
 
 http://wiki.apache.org/solr/Deduplication
 
 On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
  Hi,
  
  The ideia is don't index if something similar (headline+bodytext) for
  the same exact medianame.
  
  Do you mean I would need to index the doc first (maybe in a temp index)
  and then use the MLT feature to find similar docs before adding to final
  index?
  
  Thanks,
  Frederico
  
  
  -Original Message-
  From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
  Sent: segunda-feira, 4 de Abril de 2011 10:22
  To: solr-user@lucene.apache.org
  Subject: Re: Using MLT feature
  
  Do you want to not index if something similar? Or don't index if exact.
  I would look into a hash code of the document if you don't want to index
  exact.Similar though, I think has to be based off a document in the
  index.
  
  On Apr 4, 2011, at 5:16, Frederico Azeiteiro
  
  frederico.azeite...@cision.com wrote:
   Hi,
   
   
   
   I would like to hear your opinion about the MLT feature and if it's a
   good solution to what I need to implement.
   
   
   
   My index has fields like: headline, body and medianame.
   
   What I need to do is, before adding a new doc, verify if a similar doc
   exists for this media.
   
   
   
   My idea is to use the MorelikeThisHandler
   (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
  
  way:
   For each new doc, perform a MLT search with q= medianame and
   stream.body=headline+bodytext.
   
   If no similar docs are found than I can safely add the doc.
   
   
   
   Is this feasible using the MLT handler? Is it a good approach? Are
  
  there
  
   a better way to perform this comparison?
   
   
   
   Thank you for your help.
   
   
   
   Best regards,
   
   
   
   Frederico Azeiteiro


Re: Solrj performance bottleneck

2011-04-04 Thread openvictor Open
Dear Rahul,

Stefan has the right solution. the autosuggest must be checked both from
Javascript and your backend. For javascript there are some really nice tools
to do that such as Jquery which implements a auto-suggest with a tunable
delay. It has also highlighting, you can add additional information etc...
It is actually quite impressive. Here is the address :
http://jqueryui.com/demos/autocomplete/#remote-jsonp. It's open source so
you can just copy what they have done or see the method they used.
For backend limit the number of request / second per ip or session and / or
cache result. As for cache normally solr caches the common request but I
don't know for term components.

Hope this helps you !

Victor

2011/4/4 Stefan Matheis matheis.ste...@googlemail.com

 rahul,

 On Mon, Apr 4, 2011 at 4:18 PM, rahul asharud...@gmail.com wrote:
  if anybody has some suggestions/experience on how to leverage
 autosuggestion
  without affecting search performance much, please do share them.

 we use javascript intervals for autosuggestion. regularly check the
 value of the monitored input field and if changed, trigger a new
 request. this will cover both cases, slow-typing users and also
 ten-finger-guys (which will type much faster). a new request for every
 added character is indeed too much, even if your backend is responding
 within a few ms.

 Regards
 Stefan



dismax boost query not useful?

2011-04-04 Thread Smiley, David W.
As I was reviewing the boosting capabilities of the dismax  edismax query 
parsers, it's not clear to me that the boost query has much use.  The value 
of boost functions, particularly with a multiplied boost that edismax supports, 
is very clear -- there are a variety of uses.  But I can't think of a useful 
case when I want to both *add* a component to the ultimate score, and for that 
component to be a non-function query (i.e. use the lucene query parser).

Also, you can basically get the same affect as a boost query via boost 
functions: bf=query(mybq)mybq=...  and note you will probably multiply 
this via product(10,query(mybq)) to boost it to an appropriate number.

~ David Smiley

Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
 Hey everybody,

I've been running into some issues indexing a very large set of documents.  
There's about 4000 PDF files, ranging in size from 160MB to 10KB.  Obviously 
this is a big task for Solr.  I have a PHP script that iterates over the 
directory and uses PHP cURL to query Solr to index the files.  For now, commit 
is set to false to speed up the indexing, and I'm assuming that Solr should be 
auto-committing as necessary.  I'm using the default solrconfig.xml file 
included in apache-solr-1.4.1\example\solr\conf.  Once all the documents have 
been finished the PHP script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time 
I tried), nearly every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' 
secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more

As far as I know there's nothing special about these documents so I'm wondering 
if it's not properly autocommitting.  What would be appropriate settings in 
solrconfig.xml for this particular application?  I'd like it to autocommit as 
soon as it needs to but no more often than that for the sake of efficiency.  
Obviously it takes long enough to index 4000 documents and there's no reason to 
make it take longer.  Thanks for your help!

~Brandon Waterloo


Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.htmlHope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo 
brandon.water...@matrix.msu.edu wrote:

  Hey everybody,

 I've been running into some issues indexing a very large set of documents.
  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
  Obviously this is a big task for Solr.  I have a PHP script that iterates
 over the directory and uses PHP cURL to query Solr to index the files.  For
 now, commit is set to false to speed up the indexing, and I'm assuming that
 Solr should be auto-committing as necessary.  I'm using the default
 solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
 all the documents have been finished the PHP script queries Solr to commit.

 The main problem is that after a few thousand documents (around 2000 last
 time I tried), nearly every document begins causing Java exceptions in Solr:

 Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
 org.apache.tika.parser.pdf.PDFParser@11d329d
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
 IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
 Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
 secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more

 As far as I know there's nothing special about these documents so I'm
 wondering if it's not properly autocommitting.  What would be appropriate
 settings in solrconfig.xml for this particular application?  I'd like it to
 autocommit as soon as it needs to but no more often than that for the sake
 of efficiency.  Obviously it takes long enough to index 4000 documents and
 there's no reason to make it take longer.  Thanks for your help!

 ~Brandon Waterloo



RE: Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo


From: Anuj Kumar [anujs...@gmail.com]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.htmlHope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo 
brandon.water...@matrix.msu.edu wrote:

  Hey everybody,

 I've been running into some issues indexing a very large set of documents.
  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
  Obviously this is a big task for Solr.  I have a PHP script that iterates
 over the directory and uses PHP cURL to query Solr to index the files.  For
 now, commit is set to false to speed up the indexing, and I'm assuming that
 Solr should be auto-committing as necessary.  I'm using the default
 solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
 all the documents have been finished the PHP script queries Solr to commit.

 The main problem is that after a few thousand documents (around 2000 last
 time I tried), nearly every document begins causing Java exceptions in Solr:

 Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
 org.apache.tika.parser.pdf.PDFParser@11d329d
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
 IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
 Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
 secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more

 As far as I know there's nothing special about these documents so I'm
 wondering if it's not properly 

Re: Problems indexing very large set of documents

2011-04-04 Thread Anuj Kumar
In the log messages are you able to locate the file at which it fails? Looks
like TIKA is unable to parse one of your PDF files for the details. We need
to hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo 
brandon.water...@matrix.msu.edu wrote:

 Looks like I'm using Tika 0.4:
 apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
 .../tika-parsers-0.4.jar

 ~Brandon Waterloo

 
 From: Anuj Kumar [anujs...@gmail.com]
 Sent: Monday, April 04, 2011 2:12 PM
 To: solr-user@lucene.apache.org
 Cc: Brandon Waterloo
 Subject: Re: Problems indexing very large set of documents

 This is related to Apache TIKA. Which version are you using?
 Please see this thread for more details-
 http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

 http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
 Hope
 it helps.

 Regards,
 Anuj

 On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo 
 brandon.water...@matrix.msu.edu wrote:

   Hey everybody,
 
  I've been running into some issues indexing a very large set of
 documents.
   There's about 4000 PDF files, ranging in size from 160MB to 10KB.
   Obviously this is a big task for Solr.  I have a PHP script that
 iterates
  over the directory and uses PHP cURL to query Solr to index the files.
  For
  now, commit is set to false to speed up the indexing, and I'm assuming
 that
  Solr should be auto-committing as necessary.  I'm using the default
  solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
  Once
  all the documents have been finished the PHP script queries Solr to
 commit.
 
  The main problem is that after a few thousand documents (around 2000 last
  time I tried), nearly every document begins causing Java exceptions in
 Solr:
 
  Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException:
  org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
 from
  org.apache.tika.parser.pdf.PDFParser@11d329d
 at
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
 at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at
  org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at
 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
  org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at
  org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at
  org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at
 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
 at
 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at
  org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
 at org.mortbay.jetty.Server.handle(Server.java:285)
 at
  org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
 at
 
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
 at
 org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
 at
 org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
 at
 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
 at
 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
  Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
  IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
 at
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
 at
  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
 at
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
 ... 23 more
  Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
  secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
 at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
 at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
 at 

Re: Matching the beginning of a word within a term

2011-04-04 Thread Brian Lamb
Thank you both for your replies. It looks like EdgeNGramFilter will do the
job nicely. Time to reindex...again.

On Fri, Apr 1, 2011 at 8:31 AM, Jan Høydahl jan@cominvent.com wrote:

 Check out
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
 Don't know if it works with phrases though

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 31. mars 2011, at 16.49, Brian Lamb wrote:

  No, I don't really want to break down the words into subwords. In the
  example I provided, I would not want kind to match either record
 because
  it is not at the beginning of the word even though kind appears in both
  records as part of a word.
 
  On Wed, Mar 30, 2011 at 4:42 PM, lboutros boutr...@gmail.com wrote:
 
  Do you want to tokenize subwords based on dictionaries ? A bit like
  disagglutination of german words ?
 
  If so, something like this could help :
 DictionaryCompoundWordTokenFilter
 
  http://search.lucidimagination.com/search/document/CDRG_ch05_5.8.8
 
  Ludovic
 
 
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
 
  2011/3/30 Brian Lamb [via Lucene] 
  ml-node+2754668-300063934-383...@n3.nabble.com
 
  Hi all,
 
  I have a field set up like this:
 
  field name=common_names multiValued=true type=text
 indexed=true
  stored=true required=false /
 
  And I have some records:
 
  RECORD1
  arr name=common_names
  strcompanion to mankind/str
  strpooch/str
  /arr
 
  RECORD2
  arr name=common_names
  strcompanion to womankind/str
  strman's worst enemy/str
  /arr
 
  I would like to write a query that will match the beginning of a word
  within
  the term. Here is the query I would use as it exists now:
 
 
 
 http://localhost:8983/solr/search/?q=*:*fq={!q.op=AND%20df=common_names}
  companion
 
  man~10
 
  In the above example. I would want to return only RECORD1.
 
  The query as it exists right now is designed to only match records
 where
  both words are present in the same term. So if I changed man to mankind
  in
  the query, RECORD1 will be returned.
 
  Even though the phrases companion and man exist in the same term in
  RECORD2,
  I do not want RECORD2 to be returned because 'man' is not at the
  beginning
  of the word.
 
  How can I achieve this?
 
  Thanks,
 
  Brian Lamb
 
 
  --
  If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 
 http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2754668.html
  To start a new topic under Solr - User, email
  ml-node+472068-1765922688-383...@n3.nabble.com
  To unsubscribe from Solr - User, click here
 
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=
  .
 
 
 
 
  -
  Jouve
  France.
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2755561.html
  Sent from the Solr - User mailing list archive at Nabble.com.




Re: Matching on a multi valued field

2011-04-04 Thread Brian Lamb
I just noticed Juan's response and I find that I am encountering that very
issue in a few cases. Boosting is a good way to put the more relevant
results to the top but it is possible to only have the correct results
returned?

On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Thank you all for your responses. The field had already been set up with
 positionIncrementGap=100 so I just needed to add in the slop.


 On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora jua...@informa.eswrote:

  A multiValued field
  is actually a single field with all data separated with
 positionIncrement.
  Try setting that value high enough and use a PhraseQuery.


 That is true but you cannot do things like:

 q=bar* foo*~10 with default query search.

 and if you use dismax you will have the same problems with multivalued
 fields. Imagine the situation:

 Doc1:
field A: [foo bar,dooh] 2 values

 Doc2:
field A: [bar dooh, whatever] Another 2 values

 the query:
qt=dismax  qf= fieldA  q = ( bar dooh )

 will return both Doc1 and Doc2. The only thing you can do in this
 situation is boost phrase query in Doc2 with parameter pf in order to get
 Doc2 in the first position of the results:

 pf = fieldA^1


 Thanks,
 JP.


 El 29/03/2011, a las 23:14, Markus Jelsma escribió:

  orly, all replies came in while sending =)
 
  Hi,
 
  Your filter query is looking for a match of man's friend in a single
  field. Regardless of analysis of the common_names field, all terms are
  present in the common_names field of both documents. A multiValued
 field
  is actually a single field with all data separated with
 positionIncrement.
  Try setting that value high enough and use a PhraseQuery.
 
  That should work
 
  Cheers,
 
  Hi all,
 
  I have a field set up like this:
 
  field name=common_names multiValued=true type=text
 indexed=true
  stored=true required=false /
 
  And I have some records:
 
  RECORD1
  arr name=common_names
 
   strman's best friend/str
   strpooch/str
 
  /arr
 
  RECORD2
  arr name=common_names
 
   strman's worst enemy/str
   strfriend to no one/str
 
  /arr
 
  Now if I do a search such as:
  http://localhost:8983/solr/search/?q=*:*fq={!q.op=AND
  df=common_names}man's friend
 
  Both records are returned. However, I only want RECORD1 returned. I
  understand why RECORD2 is returned but how can I structure my query so
  that only RECORD1 is returned?
 
  Thanks,
 
  Brian Lamb





Re: Matching on a multi valued field

2011-04-04 Thread Juan Pablo Mora
I have not find any solution to this. The only thing is to denormalize your 
multivalue field into several docs with a single value field.

Try ComplexPhraseQueryParser (https://issues.apache.org/jira/browse/SOLR-1604) 
if you are using solr 1.4 version.


El 04/04/2011, a las 21:21, Brian Lamb escribió:

I just noticed Juan's response and I find that I am encountering that very 
issue in a few cases. Boosting is a good way to put the more relevant results 
to the top but it is possible to only have the correct results returned?

On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb 
brian.l...@journalexperts.commailto:brian.l...@journalexperts.com wrote:
Thank you all for your responses. The field had already been set up with 
positionIncrementGap=100 so I just needed to add in the slop.


On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora 
jua...@informa.esmailto:jua...@informa.es wrote:
 A multiValued field
 is actually a single field with all data separated with positionIncrement.
 Try setting that value high enough and use a PhraseQuery.


That is true but you cannot do things like:

q=bar* foo*~10 with default query search.

and if you use dismax you will have the same problems with multivalued fields. 
Imagine the situation:

Doc1:
   field A: [foo bar,dooh] 2 values

Doc2:
   field A: [bar dooh, whatever] Another 2 values

the query:
   qt=dismax  qf= fieldA  q = ( bar dooh )

will return both Doc1 and Doc2. The only thing you can do in this situation is 
boost phrase query in Doc2 with parameter pf in order to get Doc2 in the first 
position of the results:

pf = fieldA^1


Thanks,
JP.


El 29/03/2011, a las 23:14, Markus Jelsma escribió:

 orly, all replies came in while sending =)

 Hi,

 Your filter query is looking for a match of man's friend in a single
 field. Regardless of analysis of the common_names field, all terms are
 present in the common_names field of both documents. A multiValued field
 is actually a single field with all data separated with positionIncrement.
 Try setting that value high enough and use a PhraseQuery.

 That should work

 Cheers,

 Hi all,

 I have a field set up like this:

 field name=common_names multiValued=true type=text indexed=true
 stored=true required=false /

 And I have some records:

 RECORD1
 arr name=common_names

  strman's best friend/str
  strpooch/str

 /arr

 RECORD2
 arr name=common_names

  strman's worst enemy/str
  strfriend to no one/str

 /arr

 Now if I do a search such as:
 http://localhost:8983/solr/search/?q=*:*fq={!q.op=ANDhttp://localhost:8983/solr/search/?q=*:*fq=%7B!q.op=AND
 df=common_names}man's friend

 Both records are returned. However, I only want RECORD1 returned. I
 understand why RECORD2 is returned but how can I structure my query so
 that only RECORD1 is returned?

 Thanks,

 Brian Lamb






RE: Using the Data Import Handler with SQLite

2011-04-04 Thread Zac Smith
I was able to resolve this issue by using a different jdbc driver: 
http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC


-Original Message-
From: Zac Smith [mailto:z...@trinkit.com] 
Sent: Friday, April 01, 2011 5:56 PM
To: solr-user@lucene.apache.org
Subject: Using the Data Import Handler with SQLite

I hope this question is being directed to the right place ...

I am trying to use SQLite (v3) as a source for the Data Import Handler. I am 
using a sqllite jdbc driver (link below) and this works when using with only 
one entity. As soon as I add a sub-entity it falls over with a locked DB error: 
java.sql.SQLException: database is locked.
Now I realize that you can only have one connection open to SQLite at a time. 
So I assume that the first query is leaving a connection open before it moves 
onto the sub-query. I am not sure if the issue would be in the jdbc driver or 
the DIH. It works fine with SQL Server.

Is this a bug? Or something that just isn't possible with SQLite?

Here is a sample of my data config file:
dataConfig
  dataSource type=JdbcDataSource 
  driver=org.sqlite.JDBC
  url=jdbc:sqlite:SolrImportTest.db /
  document
entity name=locations
pk=id
query=select * from locations
field column=Id name=Id /
field column=Name name=Name / 
field column=RegionId name=RegionId /
entity name=regions
pk=id
query=select * from regions where id = 
'${locations.RegionId}'
field column=Name name=RegionName /
/entity
/entity
  /document
/dataConfig

sqllite jdbc driver : http://www.zentus.com/sqlitejdbc/


Re: does overwrite=false work with json

2011-04-04 Thread David Murphy
I tried it with the example json documents, and even if I add overwrite=false 
to the URL, it still overwrites.

Do this twice:
curl 'http://localhost:8983/solr/update/json?commit=trueoverwrite=false' 
--data-binary @books.json -H 'Content-type:application/json'

Then do this query:
curl 'http://localhost:8983/solr/select?q=title:monsterswt=jsonindent=true'

--Dave


Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for mortal collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice nice to
have (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of smallish
hadoop features subset for solr clusters, but nothing oversized.

Cheers,
eks









On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Is it possible in solr to have multivalued id? Or I need to make my
 : own mv_ID for this? Any ideas how to achieve this efficiently?

 This isn't something the SignatureUpdateProcessor is going to be able to
 hel pyou with -- it does the deduplication be changing hte low level
 update (implemented as a delete then add) so that the key used to delete
 the older documents is based on the signature field instead of the id
 field.

 in order to do what you are describing, you would need to query the index
 for matching signatures, then add the resulting ids to your document
 before doing that update

 You could posibly do this in a custom UpdateProcessor, but you'd have to
 do something tricky to ensure you didn't overlook docs that had been addd
 but not yet committed when checking for dups.

 I don't have a good suggestion for how to do this internally in Slr -- it
 seems like the type of bulk processing logic that would be better suited
 for an external process before you ever start indexing (much like link
 analysis for back refrences)

 -Hoss



Re: Mongo REST interface and full data import

2011-04-04 Thread andrew_s
Sorry for mistake with Solr version ... I'm using Solr 3.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2777319.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Matching on a multi valued field

2011-04-04 Thread Jonathan Rochkind

On 4/4/2011 3:21 PM, Brian Lamb wrote:

I just noticed Juan's response and I find that I am encountering that very
issue in a few cases. Boosting is a good way to put the more relevant
results to the top but it is possible to only have the correct results
returned?


Only what's already been said in the thread.  You can simulate a 
non-phrase non-wildcard search, forced to match all within the same 
value of a multi-valued, by using phrase queries with slop.  And it will 
only return hits that have all terms within the same value -- it's not a 
boosting solution.


But if you need wildcards, or you need to find an actual phrase in the 
same value as additional term(s) or phrase(s), no, you are out of luck 
in Solr.


That is, exactly what Juan said, he already said exactly this.

If someone can think of a clever way to write some Java to do this in a 
new query component, that would be useful.  I am not entirely sure how 
possible that is.  I guess you'd have to make sure that ALL matching 
tokens or phrases are within the positionIncrementGap of each other, not 
sure how feasible that is, I'm not too familiar with Solr/Lucene 
source.   But at any rate, there's no way to do it out of the box with 
Solr, no.




Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-04 Thread Jens Mueller
Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 10 Queries/ per Second

 C) Updates: 10 Updates / per Second




Solr offers:

1.)Replication = Scales Well for B)  BUT  A) and C) are not satisfied


2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As
I understand the Sharding approach all goes through a central server, that
dispatches the updates and assembles the quries retrieved from the different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can experiment/search
further…


Maybe I am missing something very trivial as I think some of the “Solr
Users/Use Cases” on the homepage are that kind of large deployments. How are
they implemented?



Thanky very much!!!

Jens