Re: Problem with suggest search

2010-03-16 Thread David Rühr

Thank you.

This work good as workaround. Yesterday I get the Tipp to look for wrong 
solrconfig.xml and that was right.

By uploading our Files the solrconfig.xml was LOST ;-)

Is it possible to start Java in Debugmode for more Infos?

David

Am 16.03.2010 02:02, schrieb Tom Hill:

You need a query string with the standard request handler. (dismax has
q.alt)

Try q=*:*, if you are trying to get facets for all documents.

And yes, a friendlier error message would be a good thing.

Tom

On Mon, Mar 15, 2010 at 9:03 AM, David Rührd...@marketing-factory.de  wrote:

   

Hi List.

We have two Servers dev and live.
Dev is not our Problem but on live we see with the facet.prefix paramter -
if there is no q param - for suggest search this error:

HTTP Status 500 - null java.lang.NullPointerException at
java.io.StringReader.init(StringReader.java:54) at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197) at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78) at
org.apache.solr.search.QParser.getQuery(QParser.java:137) at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:85)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:811)

The Query looks like:
facet=onfacet.mincount=1facet.limit=10json.nl
=mapwt=jsonrows=0version=1.2omitHeader=truefl=contentstart=0q=facet.prefix=matefacet.field=contentfq=group:0+OR+group:-2+OR+group:1+OR+group:11+-group:-1fq=language:0

When we add the q param f.e. q=material we have no error.
Anyone have the same error or can help?

Thanks to all.
David

 
   



Mit freundlichen Grüßen,

David Rühr
PHP Programmierer

--
Marketing Factory Consulting GmbH*   mailto:d...@marketing-factory.de
Stephanienstraße 36  *  Tel.: +49 
211-361176-58
D-40211 Düsseldorf, Germany  *  Fax:  +49 211-361176-99
Amtsgericht Düsseldorf HRB 53971  *  http://www.marketing-factory.de/

Geschäftsführer:Peter Faisst   |   Katja Faisst Karoline Steinfatt   |   
Christoph Allefeld   |   Markus M. Kimmel



Re: AutoSuggest

2010-03-16 Thread Suram



Shalin Shekhar Mangar wrote:
 
 On Sat, Mar 13, 2010 at 9:30 AM, Suram reactive...@yahoo.com wrote:
 

 Erick Erickson wrote:
 
  Did you commit your changes?
 
  Erick
 
  On Fri, Mar 12, 2010 at 7:38 AM, Suram reactive...@yahoo.com wrote:
 
 
  Can set my index fields for auto Suggestion, sometime the newly index
  field
  not found for auto suggestion and index search
  --
  View this message in context:
  http://old.nabble.com/AutoSuggest-tp27874542p27874542.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 

 ya obviously i commit the changes.but it won't suggest


 How are you trying to do the auto-suggest? Paste your field's and type's
 schema definition as well as the Solr URL you are hitting.
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

Hi shalin,

 here attached my schema.xml 
http://old.nabble.com/file/p27916777/schema.xml schema.xml 

Hitting query is  
http://localhost:8080/solr/core0/terms?terms.fl=nameterms.prefix=bomitHeader=true

-- 
View this message in context: 
http://old.nabble.com/AutoSuggest-tp27874542p27916777.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to get Term Positions?

2010-03-16 Thread Grant Ingersoll
If you're going to spend time mucking w/ TermPositions, you should just spend 
your time working with SpanQuery, as that is what I understand you to be asking 
about.  AIUI, you want to be able to get at the positions in the document where 
the query matched.  This is exactly what a SpanQuery and it's derivatives does. 
 It does all the work that you would have to do yourself by using the 
TermPositions class.


On Mar 12, 2010, at 6:38 PM, MitchK wrote:

 
 Thank you both for your responses.
 
 However, I am not familiar enough with Solr and even not with Lucene. So, at
 the moment, I have no real idea of what payloads are (I can't even translate
 this word...). 
 The manual says something about metadata - but there is nothing said about
 what metadata they mean.
 I think that - looking at my little experiences with Lucene and Solr - it
 would be a better idea to firstly read some stuff like Lucene in Action,
 before tryring to customize (or contribute to)  Lucene/Solr at such a level. 
 
 Do they currently work on the tickets? It seems like there was no more time
 to do so??
 
 Last but not least: I want to add something productive to my question:
 The paper that maybe describes the solution for my problem... 
 
 http://lucene.apache.org/java/3_0_1/fileformats.html#Positions
 
 To quote:
 PositionDelta is, if payloads are disabled for the term's field, the
 difference between the position of the current occurrence in the document
 and the previous occurrence (or zero, if this is the first occurrence in
 this document). 
 
 If I could retrive the given information, this would be great - even if it
 forces me to iterate over the document where the term occurs. Lucene's
 TermPositions-Class seems to be a good place to start, doesn't it??? What do
 you think? [1] 
 
 Integrating some Lucene-based work to Solr is another question...I think one
 needs to have a map, where one can see which class is usually called by
 which class, but that is really another topic :). 
 
 [1]
 http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/store/instantiated/InstantiatedTermPositions.html
 
 Thank you!
 - Mitch
 -- 
 View this message in context: 
 http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27884130.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Spatial search in Solr 1.5

2010-03-16 Thread Grant Ingersoll

On Mar 15, 2010, at 11:36 AM, Jean-Sebastien Vachon wrote:

 Hi All,
 
 I'm trying to figure out how to perform spatial searches using Solr 1.5 (from 
 the trunk).
 
 Is the support for spatial search built-in?

Almost.  Main thing missing right now is filtering.  There are still ways to do 
spatial filtering, but it isn't complete yet.  In the meantime, range queries 
and or frange might help.

 because none of the patches I tried could be applied to the source tree.
 If this is the case, can someone one tell me how to configure it?

http://wiki.apache.org/solr/SpatialSearch has most of the docs, but they aren't 
complete yet.

Here's what I would do:
Check out latest Solr
Build the example: ant clean example
Start the example: cd example; java -jar start.jar
Rebuild the index: cd exampledocs; java -jar post.jar *.xml
Run a query:  http://localhost:8983/solr/select/?q=_val_:recip(dist(2, store, 
vector(34.0232,-81.0664)),1,1,0)fl=*,score  // Note, I just updated this, it 
used to be point instead of vector and that was wrong.

Next, have a look at the docs in exampledocs and specifically the store field, 
which contains the location.  Then go check out the schema for that field.

HTH,
Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



solr.WordDelimiterFilterFactory problem with hyphenated terms?

2010-03-16 Thread Demian Katz
This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title Love customs in 
eighteenth-century Spain.  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)
   // no hyphen, no phrases
title:(Love customs in eighteenth-century Spain)  
// phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)
  // hyphen, no phrases
title:(Love customs in eighteenth century Spain)  
 // phrase search on whole title, without hyphen
title:(Love customs in eighteenth-century Spain)  
// hyphenated word as phrase
title:(Love customs in eighteenth century Spain)  
 // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=schema.UnicodeNormalizationFilterFactory 
version=icu4j composed=false remove_diacritics=true 
remove_modifiers=true fold=true/
filter class=solr.ISOLatin1AccentFilterFactory/
  /analyzer
/fieldType

I did notice that in the text field type in VuFind's schema has 
catenateWords and catenateNumbers turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

From playing with the Field Analysis tool, I notice that there is a gap in the 
term position sequence after analysis...  but I'm not sure if this is 
significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian



DIH request parameters

2010-03-16 Thread Lukas Kahwe Smith
Hi,

According to the wiki its possible to pass parameters to the DIH:
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

I assume they are just being replaced via simple string replacements, which is 
exactly what I need. Can they also be in all places, even attributes (for 
example to pass in the password)?

Furthermore is there some way to define default values for these request 
parameters in case no value is passed in?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





SQL and $deleteDocById

2010-03-16 Thread Lukas Kahwe Smith
Hi,

I am trying to use $deleteDocById to delete rows based on an SQL query in my 
db-data-config.xml. The following tag is a top level tag in the document tag.

entity name=company_del query=SELECT e.id AS `$deleteDocById` ROM 
deletedentity AS e/

However it seems like its only fetching the rows, its not actually issuing any 
index deletes.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: PDF extraction leads to reversed words

2010-03-16 Thread Abdelhamid ABID
Hi again ,
I just came from trying the version 1.5-dev from Solr trunk.
After applying the patch you provided, and adding icu4j-3_8_1 in classpath,
results are pretty good different then before.
Now words and texts are not reversed and are displayed correctly except some
pdf files's text parts that Solr display in a strange
manner, specially when arabic and latin are in the same paragraph, I 'll
check again for this.



On Tue, Mar 9, 2010 at 4:13 PM, Robert Muir rcm...@gmail.com wrote:

 On Tue, Mar 9, 2010 at 10:10 AM, Abdelhamid  ABID aeh.a...@gmail.com
 wrote:
  nor 3.8 version does change anythings !
 

 the patch (https://issues.apache.org/jira/browse/SOLR-1813) can only
 work on Solr trunk. It will not work with Solr 1.4.


 Solr 1.4 uses pdfbox-0.7.3.jar, which does not support Arabic.
 Solr trunk uses pdfbox-0.8.0-incubating.jar, which does support
 Arabic, if you also put ICU in the classpath.

 --
 Robert Muir
 rcm...@gmail.com




-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Switching data dir on the fly

2010-03-16 Thread schmax

I generate solr index on an hadoop cluster and I want to copy it from HDFS to
a server running solr.

I wish to copy the index on a different disk than the disk that solr
instance is using, then tell the solr server to switch from the current data
dir to the location where I copied the hadoop generated index (without
having search service interruptions).

Is it possible? Anyone has a better solution?

Thanks
-- 
View this message in context: 
http://old.nabble.com/Switching-data-dir-on-the-fly-tp27920425p27920425.html
Sent from the Solr - User mailing list archive at Nabble.com.



Stemming suggestions

2010-03-16 Thread blargy

Most of our documents will be in English but not all and we are certain in
the process of acquiring more international content. Does anyone have any
experience using all of the different stemmers for languages of unknown
origin? Which ones perform the best? Give the most relevant results? What
are the main advantages of each one? I've heard that the KStemmer is a
less-aggressive stemmer and it is supposed to perform quite well will it
work for multi-languages? 

Any suggestions would be appreciated. Thanks
 
-- 
View this message in context: 
http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn
I used it mostly for KStemmer, but I also liked the fact that it included about 
a dozen or so stable patches since Solr 1.4 was released. We just use the 
included WAR in our project however. We don't use the installer or anything 
like that.






From: blargy zman...@hotmail.com
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 11:52:17 AM
Subject: LucidWorks Solr


Has anyone used this?:
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr

Other than the KStemmer and installer what are the other enhancements that
this download offers? Is it worth using over the default Solr installation?

Thanks

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

Re: LucidWorks Solr

2010-03-16 Thread AJ Chen
I'm trying it out right now. I hope it will work well out-of-box for
indexing/searching a set of documents with frequent update.
-aj

On Tue, Mar 16, 2010 at 11:52 AM, blargy zman...@hotmail.com wrote:


 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr

 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr installation?

 Thanks

 --
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
650-283-4091
*Building social media monitoring pipeline, and connecting social customers
to CRM*


Re: Stemming suggestions

2010-03-16 Thread Erick Erickson
If you search the mail archive, you'll find many discussions of
multilingual indexing/searching that'll provide you a plethora
of information.

But the synopsis as I remember is that using a single stemmer for
multiple languages is generally a bad idea

Best
Erick

On Tue, Mar 16, 2010 at 12:19 PM, blargy zman...@hotmail.com wrote:


 Most of our documents will be in English but not all and we are certain in
 the process of acquiring more international content. Does anyone have any
 experience using all of the different stemmers for languages of unknown
 origin? Which ones perform the best? Give the most relevant results? What
 are the main advantages of each one? I've heard that the KStemmer is a
 less-aggressive stemmer and it is supposed to perform quite well will it
 work for multi-languages?

 Any suggestions would be appreciated. Thanks

 --
 View this message in context:
 http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
I am working on an application that currently hits a database containing 
millions of very large documents. I use Oracle Text Search at the moment, and 
things work fine. However, there is a request for faceting capability, and Solr 
seems like a technology I should look at. Suffice to say I am new to Solr, but 
at the moment I see two approaches-each with drawbacks:


1)  Have Solr index document metadata (id, subject, date). Then Use Oracle 
Text to do a content search based on criteria. Finally, query the Solr index 
for all documents whose id's match the set of id's returned by Oracle Text. 
That strikes me as an unmanageable Boolean query.  (e.g. 
id:4ORid:33432323OR...).

2)  Remove Oracle Text from the equation and use Solr to query document 
content based on search criteria. The indexing process though will almost 
certainly encounter an OutOfMemoryError given the number and size of documents.



I am using the embedded server and Solr Java APIs to do the indexing and 
querying.



I would welcome your thoughts on the best way to approach this situation. 
Please let me know if I should provide additional information.



Thanks.


Re: LucidWorks Solr

2010-03-16 Thread blargy

Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
 
 I used it mostly for KStemmer, but I also liked the fact that it included
 about a dozen or so stable patches since Solr 1.4 was released. We just
 use the included WAR in our project however. We don't use the installer or
 anything like that.
 
 
 
 
 
 
 From: blargy zman...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 16, 2010 11:52:17 AM
 Subject: LucidWorks Solr
 
 
 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
 
 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr
 installation?
 
 Thanks
 
 -- 
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
   
 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Erick Erickson
Why do you think you'd hit OOM errors? How big is very large? I've
indexed, as a single document, a 26 volume encyclopedia of civil war
records..

Although as much as I like the technology, if I could get away without using
two technologies, I would. Are you completely sure you can't get what you
want with clever Oracle querying?

Best
Erick

On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.



XML data in solr field

2010-03-16 Thread Nair, Manas
Hello Experts,
 
I need help on this issue of mine. I am unsure if this scenario is possible.
I have a field in my solr document named inputxml, the value of which is a 
xml string as below. This xml structure is within the inputxml field value. I 
needed help on searching this xml structure i.e. if I search  for Venue, I 
should get Radio City Music Hall as the result and not the complete tag like 
Venue value=Radio City Music Hall /. Is this supported in solr?? If it is, 
how can this be implemented??
 
root
Venue value=Radio City Music Hall /
Link value=http://bit.ly/Rndab; /
LinkText value=En savoir + /
Address value=New-York, USA /
/root

Any help is appreciated. I donot need the tag name in the result, instead I 
need the tag value.
 
Thanks in advance,
Manas Nair


Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn
For my purposes, the Porter analyzer was overly aggressive with stemming. So, 
we then moved to KStem. It looks like this is no longer being maintained and 
Lucid claimed much better performance with theirs, so I gave that a try and it 
seems to be working fine. I didn't do any benchmarks though.

And I just took the war in LucidWorks\dist. I think in the install 
instructions, there was also a script to apply to the included source code as 
well. I did that as well since I look at the source regularly.

I didn't look at LudidGlaze or any of the other Lucid features.

-Kevin





From: blargy zman...@hotmail.com
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 12:31:09 PM
Subject: Re: LucidWorks Solr


Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
 
 I used it mostly for KStemmer, but I also liked the fact that it included
 about a dozen or so stable patches since Solr 1.4 was released. We just
 use the included WAR in our project however. We don't use the installer or
 anything like that.
 
 
 
 
 
 
 From: blargy zman...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 16, 2010 11:52:17 AM
 Subject: LucidWorks Solr
 
 
 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
 
 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr
 installation?
 
 Thanks
 
 -- 
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
  
 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Glen Newton
I've also index a concatenation of 50k journal articles (making a
single document of several hundred MB of text) and it did not give me
an OOM.

-glen


On 16 March 2010 15:57, Erick Erickson erickerick...@gmail.com wrote:
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)      Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)      Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.





-- 

-


PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
I've been trying to bulk index about 11 million PDFs, and while profiling our 
Solr instance, I noticed that all of the threads that are processing indexing 
requests are constantly blocking each other during this call:

http-8080-Processor39 [BLOCKED] CPU time: 9:35
java.util.Collections$SynchronizedMap.get(Object)
org.pdfbox.pdmodel.font.PDFont.getAFM()
org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
org.pdfbox.util.PDFStreamEngine.showString(byte[])
org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream)
org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
org.pdfbox.util.PDFTextStripper.processPages(List)
org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
org.pdfbox.util.PDFTextStripper.getText(PDDocument)
org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
Metadata)
org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
 SolrQueryResponse, ContentStream)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()

Has anyone run into this before? Any ideas on how to reduce the contention?

Thanks,
Gio.


Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Smiley, David W.
If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use bitmapped indexes which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.
 
 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I should 
 probably look into the other options.
 
 Thanks.
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr
 
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..
 
 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?
 
 Best
 Erick
 
 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:
 
 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).
 
 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.
 
 
 
 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.
 
 
 
 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.
 
 
 
 Thanks.
 






Re: XML data in solr field

2010-03-16 Thread Tommy Chheng
 Do you have the option of just importing each xml node as a 
field/value when you add the document?


That'll let you do the search easily. If you need to store the raw XML, 
you can use an extra field.


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/16/10 12:59 PM, Nair, Manas wrote:

Hello Experts,

I need help on this issue of mine. I am unsure if this scenario is possible.
I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml 
structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search  
for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue 
value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be 
implemented??

root
Venue value=Radio City Music Hall /
Link value=http://bit.ly/Rndab; /
LinkText value=En savoir + /
Address value=New-York, USA /
/root

Any help is appreciated. I donot need the tag name in the result, instead I 
need the tag value.

Thanks in advance,
Manas Nair



Solr RAM Requirements

2010-03-16 Thread KaktuChakarabati

Hey,
I am trying to understand what kind of calculation I should do in order to
come up with reasonable RAM size for a given solr machine.

Suppose the index size is at 16GB.
The Max heap allocated to JVM is about 12GB.

The machine I'm trying now has 24GB.
When the machine is running for a while serving production, I can see in top
that the resident memory taken by the jvm is indeed at 12gb.
Now, on top of this i should assume that if i want the whole index to fit in
disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind
of calculation correct or am i off here?

Any other recommendations Anyone could make w.r.t these numbers ?

Thanks,
-Chak
-- 
View this message in context: 
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Grant Ingersoll
Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()
 
 Has anyone run into this before? Any ideas on how to reduce the contention?
 
 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



RE: Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
That is a great article, David. 

For the moment, I am trying an all-Solr approach, but I have run into a small 
problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
Is there any facility to unpack this into the actual text? Or must I execute 
that in the SQL query?

Thanks.


-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: Tuesday, March 16, 2010 4:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Moving From Oracle Text Search To Solr

If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use bitmapped indexes which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.
 
 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I should 
 probably look into the other options.
 
 Thanks.
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr
 
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..
 
 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?
 
 Best
 Erick
 
 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:
 
 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).
 
 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.
 
 
 
 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.
 
 
 
 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.
 
 
 
 Thanks.
 






RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot. 

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()
 
 Has anyone run into this before? Any ideas on how to reduce the contention?
 
 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.

See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-380
[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html


On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:

 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()

 Has anyone run into this before? Any ideas on how to reduce the contention?

 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/

Re: Trouble Implementing Extracting Request Handler

2010-03-16 Thread Lance Norskog
NoClassDefFoundError usually means that the class was found, but it
needs other classes and those were not found. That is, Solr finds the
ExtractingRequestHandler jar but cannot find the Tika jars.

In example/solr/conf/slrconfig.xml, there are several 'lib
dir=path/' elements. These give classpath directories and jar files
to include when loading classes (and resource files). Try adding the
paths for your Tika jars as lib/ directives.

On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgut sreich...@axtaweb.com wrote:
 Sure. I've attached two docs that have the stack trace and the full list of
 .jar files.

 On 3/15/2010 8:34 PM, Lance Norskog wrote:

 Please post the complete stack trace. Also, it will help if you make a
 full listing of all .jar files in the example/ directory.

 On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:


 Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
 follow-on error though. It is giving the following error:
 ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

 Did we miss something else in the setup?

 Steve

 Is there something else we haven't copied

 On 3/15/2010 6:12 PM, Lance Norskog wrote:


 This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

 The ExtractingRequestHandler libraries are in contrib/extracting/lib

 You need to make a directory example/solr/lib and copy into it the
 apache-solr-cell jar from dist/ and all of the libraries from
 contrib/extracting/lib. The Wiki page has not been updated for the
 Solr 1.4 release. I just added a TODO to this effect.

 On 3/12/10, Steve Reichgutsreich...@axtaweb.com    wrote:



 Hi Grant,
 Thanks for the feedback. In reading the Wiki, it recommended that you
 copy everything from example/solr/libs directory into a /libs directory
 in your instance. I went into my example/solr directory and only see
 two
 directories - bin and conf. There is no libs directory. Where
 else
 can I get the contents of what should be in libs?

 Steve

 On 3/12/2010 2:15 PM, Grant Ingersoll wrote:



 On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:





 Now that I have configured my Solr instance for standard indexing, I
 wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
 it
 with a simple PDF file, I got the following error:

    org.apache.solr.common.SolrException: lazy loading error
    Caused by: org.apache.solr.common.SolrException: Error loading
 class
    'org.apache.solr.handler.extraction.ExtractingRequestHandler'

 Based on the error, it appeared that the problem is caused by certain
 components not being installed or installed correctly. Since I am not
 a
 Java guy, I had my Java person try to install the
 ExtractingRequestHandler to no avail. He had said that he was having
 real
 trouble finding good documentation on how to install and enable this
 handler.

 Could anyone point me to good documentation on how to
 install/troubleshoot this?




 http://wiki.apache.org/solr/ExtractingRequestHandler

 Essentially, you need to make sure the ERH stuff is in Solr/lib before
 starting.

 -Grant



















-- 
Lance Norskog
goks...@gmail.com


Re: DIH request parameters

2010-03-16 Thread Lance Norskog
They are a namespace like other namespaces and are useable in
attributes, just like in the DB query string examples.

As to defaults, you can declare those in the requestHandler
declarations in solrconfig.xml. Examples of this (search for
defaults) in the wiki page.

On Tue, Mar 16, 2010 at 7:05 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote:
 Hi,

 According to the wiki its possible to pass parameters to the DIH:
 http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

 I assume they are just being replaced via simple string replacements, which 
 is exactly what I need. Can they also be in all places, even attributes (for 
 example to pass in the password)?

 Furthermore is there some way to define default values for these request 
 parameters in case no value is passed in?

 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org







-- 
Lance Norskog
goks...@gmail.com


RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade
I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. 
This is what I've tried so far (which was really just me guessing):



1. Got the latest version of the trunk code from 
http://svn.apache.org/repos/asf/lucene/tika/trunk

2. Built this using Maven (mvn install)

3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
(tika-0.3.jar).

4. Then I bounced my servlet server and tried indexing a document. The 
document was successfully indexed, and there were no errors logged as a result, 
but the PDF data does not appear to have been extracted (the field I used for 
map.content had an empty-string as a value).



What's the right approach to perform this patch?





-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, March 16, 2010 5:41 PM
To: solr-user@lucene.apache.org
Subject: RE: PDFBox/Tika Performance Issues



Thanks Chris!



I'll try the patch.



-Original Message-

From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]

Sent: Tuesday, March 16, 2010 5:37 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.



See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).



Cheers,

Chris



[1] http://issues.apache.org/jira/browse/TIKA-380

[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html





On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:



Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.



-Original Message-

From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll

Sent: Tuesday, March 16, 2010 5:15 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?



FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.



-Grant



On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:



 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:



 http-8080-Processor39 [BLOCKED] CPU time: 9:35

 java.util.Collections$SynchronizedMap.get(Object)

 org.pdfbox.pdmodel.font.PDFont.getAFM()

 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)

 org.pdfbox.util.PDFStreamEngine.showString(byte[])

 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)

 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)

 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)

 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)

 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)

 org.pdfbox.util.PDFTextStripper.processPages(List)

 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)

 org.pdfbox.util.PDFTextStripper.getText(PDDocument)

 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)

 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)

 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)

 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)

 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)

 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)

 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)

 

Undefined field price on Dismax query

2010-03-16 Thread Alex Thurlow

Hi guys,
Based on some suggestions, I'm trying to use the dismax query 
type.  I'm getting a weird error though that I think it related to the 
default test data set.


From the query tool (/solr/admin/form.jsp), I put in this:
Statement: artist:test title:test +type:video
query type: dismax

The rest is left as defaults.  I get this error page:
HTTP ERROR: 400
undefined field price

RequestURI=/solr/select

I am running out of the example dir still, but I made my own custom 
schema and deleted the index before inserting my new data.  Am I missing 
something that needs to be cleared?  Query type=standard works fine here.


Thanks,
Alex



Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Lance Norskog
The DataImportHandler has tools for this. It will fetch rows from
Oracle and allow you to unpack columns as XML with  Xpaths.

http://wiki.apache.org/solr/DataImportHandler
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

On Tue, Mar 16, 2010 at 2:25 PM, Neil Chaudhuri
nchaudh...@potomacfusion.com wrote:
 That is a great article, David.

 For the moment, I am trying an all-Solr approach, but I have run into a small 
 problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
 Is there any facility to unpack this into the actual text? Or must I execute 
 that in the SQL query?

 Thanks.


 -Original Message-
 From: Smiley, David W. [mailto:dsmi...@mitre.org]
 Sent: Tuesday, March 16, 2010 4:45 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 If you do stay with Oracle, please report back to the list how that went.  In 
 order to get decent filtering and faceting performance, I believe you will 
 need to use bitmapped indexes which Oracle and some other databases support.

 You may want to check out my article on this subject: 
 http://www.packtpub.com/article/text-search-your-database-or-solr

 ~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


 On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.

 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I 
 should probably look into the other options.

 Thanks.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)      Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)      Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.









-- 
Lance Norskog
goks...@gmail.com


Indexing CLOB Column in Oracle

2010-03-16 Thread Neil Chaudhuri
Since my original thread was straying to a new topic, I thought it made sense 
to create a new thread of discussion.

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type, which is an instance of 
oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.

So in my db-data-config, I have the following:

document
entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID FROM DOC 
d
field column=EFFECTIVE_DT name=effectiveDate /
field column=ARCHIVE_ID name=id /
entity name=text query=SELECT d.XML FROM DOC d WHERE 
d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' transformer=ClobTransformer
field column=XML name=text clob=true sourceColName=XML 
/
/entity
/entity
/document

Meanwhile, I have this in schema.xml:

field name=text type=text_ws indexed=true stored=true 
multiValued=true omitNorms=false termVectors=true /

However, when I take a look at my indexes with Luke, I find that the items 
labeled text simply say oracle.sql.OPAQUE and a bunch of numbers-in other 
words, the OPAQUE.toString().

Can you give me some insight into where I am going wrong?

Thanks.



Re: Trouble Implementing Extracting Request Handler

2010-03-16 Thread Steve Reichgut

Lance,

I tried that but no luck. Just in case the relative paths were causing a 
problem, I also tried using absolute paths but neither seemed to help. 
First, I tried adding *lib dir=/path/to/example/solr/lib /* as the 
full directory so it would hopefully include everything. When that 
didn't work, I tried adding paths directly to the two Tika jar files in 
the Lib directory like this:

*lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and
*lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /*

Am I including them incorrectly somehow?

Steve

On 3/16/2010 3:38 PM, Lance Norskog wrote:

NoClassDefFoundError usually means that the class was found, but it
needs other classes and those were not found. That is, Solr finds the
ExtractingRequestHandler jar but cannot find the Tika jars.

In example/solr/conf/slrconfig.xml, there are several 'lib
dir=path/' elements. These give classpath directories and jar files
to include when loading classes (and resource files). Try adding the
paths for your Tika jars aslib/  directives.

On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com  wrote:
   

Sure. I've attached two docs that have the stack trace and the full list of
.jar files.

On 3/15/2010 8:34 PM, Lance Norskog wrote:
 

Please post the complete stack trace. Also, it will help if you make a
full listing of all .jar files in the example/ directory.

On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:

   

Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
follow-on error though. It is giving the following error:
ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

Did we miss something else in the setup?

Steve

Is there something else we haven't copied

On 3/15/2010 6:12 PM, Lance Norskog wrote:

 

This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

The ExtractingRequestHandler libraries are in contrib/extracting/lib

You need to make a directory example/solr/lib and copy into it the
apache-solr-cell jar from dist/ and all of the libraries from
contrib/extracting/lib. The Wiki page has not been updated for the
Solr 1.4 release. I just added a TODO to this effect.

On 3/12/10, Steve Reichgutsreich...@axtaweb.com  wrote:


   

Hi Grant,
Thanks for the feedback. In reading the Wiki, it recommended that you
copy everything from example/solr/libs directory into a /libs directory
in your instance. I went into my example/solr directory and only see
two
directories - bin and conf. There is no libs directory. Where
else
can I get the contents of what should be in libs?

Steve

On 3/12/2010 2:15 PM, Grant Ingersoll wrote:


 

On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:




   

Now that I have configured my Solr instance for standard indexing, I
wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
it
with a simple PDF file, I got the following error:

org.apache.solr.common.SolrException: lazy loading error
Caused by: org.apache.solr.common.SolrException: Error loading
class
'org.apache.solr.handler.extraction.ExtractingRequestHandler'

Based on the error, it appeared that the problem is caused by certain
components not being installed or installed correctly. Since I am not
a
Java guy, I had my Java person try to install the
ExtractingRequestHandler to no avail. He had said that he was having
real
trouble finding good documentation on how to install and enable this
handler.

Could anyone point me to good documentation on how to
install/troubleshoot this?



 

http://wiki.apache.org/solr/ExtractingRequestHandler

Essentially, you need to make sure the ERH stuff is in Solr/lib before
starting.

-Grant





   


 


   


 



   


 



   




Re: Indexing CLOB Column in Oracle

2010-03-16 Thread Shawn Heisey
Disclaimer:  My Oracle experience is miniscule at best.  I am also a 
beginner at Solr, so grab yourself the proverbial grain of salt.


I googled a bit on CLOB.  One page I found mentioned setting up a view 
to return the data type you want.  Can you use the functions described 
on these pages in either the Solr query or a view?


http://www.oradev.com/dbms_lob.jsp
http://www.dba-oracle.com/t_dbms_lob.htm
http://www.praetoriate.com/dbms_packages/ddp_dbms_lob.htm

I also was trying to find a way to convert from xmltype directly to a 
string in a query, but that quickly got way over my level of 
understanding.  I saw hints that it is possible, though.


Shawn

On 3/16/2010 4:59 PM, Neil Chaudhuri wrote:

Since my original thread was straying to a new topic, I thought it made sense 
to create a new thread of discussion.

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type, which is an instance of 
oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.
   




Re: Solr RAM Requirements

2010-03-16 Thread Peter Sturge
On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati jimmoe...@gmail.comwrote:


 Hey,
 I am trying to understand what kind of calculation I should do in order to
 come up with reasonable RAM size for a given solr machine.

 Suppose the index size is at 16GB.
 The Max heap allocated to JVM is about 12GB.

 The machine I'm trying now has 24GB.
 When the machine is running for a while serving production, I can see in
 top
 that the resident memory taken by the jvm is indeed at 12gb.
 Now, on top of this i should assume that if i want the whole index to fit
 in
 disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind
 of calculation correct or am i off here?


Hmmm..not quite. The idea of the ram usage isn't to simply hold the index in
memory - if you want this use a RAMDirectory.
The memory being used will be a combination of various caches (Lucene and
Solr), index buffers et al., and of course the server itself. The specifics
depend very
much on what your server is doing at any given time - e.g. lots of
concurrent searches, lots of indexing, both etc., and how things are setup
in your solrconfig.xml.

A really excellent resource that's worth looking at regarding all this can
be found here:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr



 Any other recommendations Anyone could make w.r.t these numbers ?

 Thanks,
 -Chak
 --
 View this message in context:
 http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Undefined field price on Dismax query

2010-03-16 Thread Erick Erickson
I suspect your problem is that you still have price defined in

solrconfig.xml for the dismax handler. Look for the section
requestHandler name=dismax..

You'll see price defined as one of the default fields for fl and bf.

HTH
Erick

On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlow a...@blastro.com wrote:

 Hi guys,
Based on some suggestions, I'm trying to use the dismax query type.  I'm
 getting a weird error though that I think it related to the default test
 data set.

 From the query tool (/solr/admin/form.jsp), I put in this:
 Statement: artist:test title:test +type:video
 query type: dismax

 The rest is left as defaults.  I get this error page:
 HTTP ERROR: 400
 undefined field price

 RequestURI=/solr/select

 I am running out of the example dir still, but I made my own custom
 schema and deleted the index before inserting my new data.  Am I missing
 something that needs to be cleared?  Query type=standard works fine here.

 Thanks,
 Alex




Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Erick Erickson
Besides the other notes here, I agree you'll hit OOM if you try to
read all the rows into memory at once, but I'm absolutely sure you
can read then N at a time instead. Not that I could tell you how, mind
you.

You're on your way...
Erick

On Tue, Mar 16, 2010 at 4:13 PM, Neil Chaudhuri 
nchaudh...@potomacfusion.com wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted
 results, but I am not sure of the flexibility, extensibility, or scalability
 of that approach. And from what I have read, Oracle Text doesn't do faceting
 out of the box.

 Each document is a few MB, and there will be millions of them. I suppose it
 depends on how I index them. I am pretty sure my current approach of using
 Hibernate to load all rows, constructing Solr POJO's from them, and then
 passing the POJO's to the embedded server would lead to a OOM error. I
 should probably look into the other options.

 Thanks.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without
 using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

  I am working on an application that currently hits a database containing
  millions of very large documents. I use Oracle Text Search at the moment,
  and things work fine. However, there is a request for faceting
 capability,
  and Solr seems like a technology I should look at. Suffice to say I am
 new
  to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
  1)  Have Solr index document metadata (id, subject, date). Then Use
  Oracle Text to do a content search based on criteria. Finally, query the
  Solr index for all documents whose id's match the set of id's returned by
  Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
  id:4ORid:33432323OR...).
 
  2)  Remove Oracle Text from the equation and use Solr to query
 document
  content based on search criteria. The indexing process though will almost
  certainly encounter an OutOfMemoryError given the number and size of
  documents.
 
 
 
  I am using the embedded server and Solr Java APIs to do the indexing and
  querying.
 
 
 
  I would welcome your thoughts on the best way to approach this situation.
  Please let me know if I should provide additional information.
 
 
 
  Thanks.
 



Re: Undefined field price on Dismax query

2010-03-16 Thread Alex Thurlow
Aha.  That appears to be the issue.  I hadn't realized that the query 
handler had all of those definitions there.


-Alex


On 3/16/2010 6:56 PM, Erick Erickson wrote:

I suspect your problem is that you still have price defined in

solrconfig.xml for the dismax handler. Look for the section
requestHandler name=dismax..

You'll see price defined as one of the default fields for fl and bf.

HTH
Erick

On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlowa...@blastro.com  wrote:

   

Hi guys,
Based on some suggestions, I'm trying to use the dismax query type.  I'm
getting a weird error though that I think it related to the default test
data set.

 From the query tool (/solr/admin/form.jsp), I put in this:
Statement: artist:test title:test +type:video
query type: dismax

The rest is left as defaults.  I get this error page:
HTTP ERROR: 400
undefined field price

RequestURI=/solr/select

I am running out of the example dir still, but I made my own custom
schema and deleted the index before inserting my new data.  Am I missing
something that needs to be cleared?  Query type=standard works fine here.

Thanks,
Alex


 
   


Solr query parser doesn't invoke analyzer for simple term query?

2010-03-16 Thread Teruhiko Kurosaka
It seems that Solr's query parser doesn't pass a single term query
to the Analyzer for the field. For example, if I give it
2001年 (year 2001 in Japanese), the searcher returns 0 hits 
but if I quote them with double-quotes, it returns hits. 
In this experiment, I configured schema.xml so that
the field in question will use the morphological Analyzer 
my company makes that is capable of splitting 2001年  
into two tokens 2001 and 年.  I am guessing that this
Analyzer is called ONLY IF the term is a phrase.
Is my observation correct?

If so, is there any configuration parameter that I can tweak 
to force any query for the text fields be processed by 
the Analyzer?

One might ask why users won't put space between 2001 and 年.
Well if they are clearly two separate words, people do that.
But 年 works more like a suffix in this case, and in many
Japanese speaker's mind, 2001年 seems like one token, so
many people won't.  (Remember Japanese don't use spaces
in normal writing.)  Forcing to use Analyzer would also
be useful for compound word handling often desirable
for languages like German.


Teruhiko Kuro Kurosaka
RLP + Lucene  Solr = powerful search for global contents



problem during benchmarking solr query

2010-03-16 Thread KshamaPai

Hi,
Am using autobench to benchmark solr with the query
http://localhost:8983/solr/select/?q=body:hotel AND
_val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

But if i specify the same in the autobench command as
autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20
--host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10
--uri1 /solr/select/?q=body:hotel AND  
_val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

it is taking body:hotel as uri but not _val_ part ,which i think is because
of the space after hotel. Even if i try  escaping  this in autobench using
'\' it ll give parse error in solr.

Can any one suggest me how do i handle this?so that entire query is
considered as uri  and also solr respond with appropriate reply.
thank you.
 

-- 
View this message in context: 
http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr RAM Requirements

2010-03-16 Thread Peter Sturge
There are certainly a number of widely varying opinions on the use of RAM
directory.
Basically, though, if you need the index to be persistent at some point
(i.e. saved across reboots, crashes etc.),
you'll need to write to a disk, so RAM directory becomes somewhat
superfluous in this case.

Generally, good hardware and fast disks are a better bet, since you'll
probably want to have them anyway :-)

From my own experiences with varying types/sizes of indexes, and the general
wisdom gleamed from the experts, the amount of memory required for a given
environment is very much
a 'how long is a piece of string' type of scenario. It depends on so many
factors that it's impractical to come up with a easy 'standardized' formula.

What I've found useful as a rough guidance (in additon to the very useful
URL I mentioned earlier), is if your server is doing lots of indexing and
not much searching, you want your os fs cache to have access to a healthy
amount of memory.
If you're doing lots of searching/reading (and particularly faceting),
you'll want a good amount of ram for Solr/Lucene caching (which caches need
what depends on the type of data you're searching).
If you have a server that is doing a lot of both indexing and searching, you
should consider breaking them out using replication and possibly using load
balancers (if you have lots of concurrent querying going on).

It stands to reason that the bigger the index gets, the more memory will
generally be required for working on various aspects of it. When you get
into very large indexes, it becomes more efficient to distribute the
indexing across servers (and replicating those servers), so that no single
machine has huge cache lists to traverse. Again, the 'Scaling Lucene and
Solr' page goes into these scenarios and is well worth studying.



On Wed, Mar 17, 2010 at 12:29 AM, KaktuChakarabati jimmoe...@gmail.comwrote:


 Hey Peter,
 Thanks for your reply.
 My question was mainly about the fact there seems to be two different
 aspects to the solr RAM usage: in-process and out-process.
 By that I mean, yes i know the many different parameters/caches to do with
 solr in-process memory usage and related culprits, however I also
 understand
 that as for actual index access (posting list, positional index etc), solr
 mostly delegates the access/caching of this to the OS/disk cache.
 So I guess my question is more about that: namely, what would be a good way
 to calculate an overall ram requirement profile for a server running solr?
 Also, I was under the impression benefits from RAMDirectory would be
 minimal
 when disk caches are effective no?
 And does RAMDirectory work with replication? if so, doesnt it slow it down?
 ( on each replication, load up entire index to RAM at once? )



 Peter Sturge wrote:
 
  On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati
  jimmoe...@gmail.comwrote:
 
 
  Hey,
  I am trying to understand what kind of calculation I should do in order
  to
  come up with reasonable RAM size for a given solr machine.
 
  Suppose the index size is at 16GB.
  The Max heap allocated to JVM is about 12GB.
 
  The machine I'm trying now has 24GB.
  When the machine is running for a while serving production, I can see in
  top
  that the resident memory taken by the jvm is indeed at 12gb.
  Now, on top of this i should assume that if i want the whole index to
 fit
  in
  disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this
  kind
  of calculation correct or am i off here?
 
 
  Hmmm..not quite. The idea of the ram usage isn't to simply hold the index
  in
  memory - if you want this use a RAMDirectory.
  The memory being used will be a combination of various caches (Lucene and
  Solr), index buffers et al., and of course the server itself. The
  specifics
  depend very
  much on what your server is doing at any given time - e.g. lots of
  concurrent searches, lots of indexing, both etc., and how things are
 setup
  in your solrconfig.xml.
 
  A really excellent resource that's worth looking at regarding all this
 can
  be found here:
 
 
 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
 
 
 
  Any other recommendations Anyone could make w.r.t these numbers ?
 
  Thanks,
  -Chak
  --
  View this message in context:
  http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27926536.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Stopwords

2010-03-16 Thread blargy

I was reading Scaling Lucen and Solr
(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
and I came across the section StopWords. 

In there it mentioned that its not recommended to remove stop words at index
time. Why is this the case? Don't all the extraneous stopwords bloat the
index and lead to less relevant results? Can someone please explain this to
me. Thanks
-- 
View this message in context: 
http://old.nabble.com/Stopwords-tp27927028p27927028.html
Sent from the Solr - User mailing list archive at Nabble.com.



APR setup

2010-03-16 Thread blargy

[java] INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path:
.:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java

What the heck is this and why is it recommended for production settings?
Anyone?

-- 
View this message in context: 
http://old.nabble.com/APR-setup-tp27927553p27927553.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Trouble Implementing Extracting Request Handler

2010-03-16 Thread Lance Norskog
org/apache/solr/util/plugin/SolrCoreAware in the stack trace refers to
an interface in the main Solr jar.

I think this means that putting all of the libs in
apache-tomcat-6.0.20/lib is a mistake: the classloader finds
ExtractingRequestHandler in
apache-tomcat-6.0.20/lib/apache-solr-cell-1.4.1-dev.jar, but that it
wants the above interface. The main Solr jar is not available somehow.
Since the solr-cell jar is in multiple places, we don't know exactly
how Tomcat finds it.

I suggest that you go back to a clean, empty Tomcat, and the original
Solr distribution. Copy the solr war file to the right directory in
Tomcat. Get Solr talking to your solr/ directory
(-Dsolr.solr.home=path). Now, check if the lib directives in the
solrconfig.xml are right.



On Tue, Mar 16, 2010 at 4:19 PM, Steve Reichgut sreich...@axtaweb.com wrote:
 Lance,

 I tried that but no luck. Just in case the relative paths were causing a
 problem, I also tried using absolute paths but neither seemed to help.
 First, I tried adding *lib dir=/path/to/example/solr/lib /* as the full
 directory so it would hopefully include everything. When that didn't work, I
 tried adding paths directly to the two Tika jar files in the Lib directory
 like this:
 *lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and
 *lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /*

 Am I including them incorrectly somehow?

 Steve

 On 3/16/2010 3:38 PM, Lance Norskog wrote:

 NoClassDefFoundError usually means that the class was found, but it
 needs other classes and those were not found. That is, Solr finds the
 ExtractingRequestHandler jar but cannot find the Tika jars.

 In example/solr/conf/slrconfig.xml, there are several 'lib
 dir=path/' elements. These give classpath directories and jar files
 to include when loading classes (and resource files). Try adding the
 paths for your Tika jars aslib/  directives.

 On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:


 Sure. I've attached two docs that have the stack trace and the full list
 of
 .jar files.

 On 3/15/2010 8:34 PM, Lance Norskog wrote:


 Please post the complete stack trace. Also, it will help if you make a
 full listing of all .jar files in the example/ directory.

 On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:



 Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
 follow-on error though. It is giving the following error:
 ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

 Did we miss something else in the setup?

 Steve

 Is there something else we haven't copied

 On 3/15/2010 6:12 PM, Lance Norskog wrote:



 This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

 The ExtractingRequestHandler libraries are in contrib/extracting/lib

 You need to make a directory example/solr/lib and copy into it the
 apache-solr-cell jar from dist/ and all of the libraries from
 contrib/extracting/lib. The Wiki page has not been updated for the
 Solr 1.4 release. I just added a TODO to this effect.

 On 3/12/10, Steve Reichgutsreich...@axtaweb.com      wrote:




 Hi Grant,
 Thanks for the feedback. In reading the Wiki, it recommended that you
 copy everything from example/solr/libs directory into a /libs
 directory
 in your instance. I went into my example/solr directory and only see
 two
 directories - bin and conf. There is no libs directory. Where
 else
 can I get the contents of what should be in libs?

 Steve

 On 3/12/2010 2:15 PM, Grant Ingersoll wrote:




 On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:






 Now that I have configured my Solr instance for standard indexing,
 I
 wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
 it
 with a simple PDF file, I got the following error:

    org.apache.solr.common.SolrException: lazy loading error
    Caused by: org.apache.solr.common.SolrException: Error loading
 class
    'org.apache.solr.handler.extraction.ExtractingRequestHandler'

 Based on the error, it appeared that the problem is caused by
 certain
 components not being installed or installed correctly. Since I am
 not
 a
 Java guy, I had my Java person try to install the
 ExtractingRequestHandler to no avail. He had said that he was
 having
 real
 trouble finding good documentation on how to install and enable
 this
 handler.

 Could anyone point me to good documentation on how to
 install/troubleshoot this?





 http://wiki.apache.org/solr/ExtractingRequestHandler

 Essentially, you need to make sure the ERH stuff is in Solr/lib
 before
 starting.

 -Grant

























-- 
Lance Norskog
goks...@gmail.com


spanish solr tutorial

2010-03-16 Thread Juan Pedro Danculovic
Hi all, we translated the Solr tutorial to Spanish due to a client's
request. For all you Spanish speakers/readers out there, you can have a look
at it:

http://www.linebee.com/?p=155

We hope this can expand the usage of the project and lower the language
barrier to non-english speakers.

Thanks

Juan Danculovic
CTO - www.linebee.com


Re: APR setup

2010-03-16 Thread Lance Norskog
That would be a Tomcat question :)

On Tue, Mar 16, 2010 at 8:36 PM, blargy zman...@hotmail.com wrote:

 [java] INFO: The APR based Apache Tomcat Native library which allows optimal
 performance in production environments was not found on the
 java.library.path:
 .:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java

 What the heck is this and why is it recommended for production settings?
 Anyone?

 --
 View this message in context: 
 http://old.nabble.com/APR-setup-tp27927553p27927553.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: problem during benchmarking solr query

2010-03-16 Thread Lance Norskog
Use a + sign or %20 for the space. The URL standard uses a plus to mean a space.

On Tue, Mar 16, 2010 at 6:06 PM, KshamaPai kshamapai2...@gmail.com wrote:

 Hi,
 Am using autobench to benchmark solr with the query
 http://localhost:8983/solr/select/?q=body:hotel AND
 _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

 But if i specify the same in the autobench command as
 autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20
 --host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10
 --uri1 /solr/select/?q=body:hotel AND
 _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

 it is taking body:hotel as uri but not _val_ part ,which i think is because
 of the space after hotel. Even if i try  escaping  this in autobench using
 '\' it ll give parse error in solr.

 Can any one suggest me how do i handle this?so that entire query is
 considered as uri  and also solr respond with appropriate reply.
 thank you.


 --
 View this message in context: 
 http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)
Hi Giovanni,

Comments below:

 I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
 This is what I've tried so far (which was really just me guessing):
 
 
 
 1. Got the latest version of the trunk code from
 http://svn.apache.org/repos/asf/lucene/tika/trunk
 
 2. Built this using Maven (mvn install)
 

On track so far.

 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
 folder for my Solr Core, and renamed it to the name of the existing Tika Jar
 (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

 
 4. Then I bounced my servlet server and tried indexing a document. The
 document was successfully indexed, and there were no errors logged as a
 result, but the PDF data does not appear to have been extracted (the field I
 used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

 -Original Message-
 From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
 Sent: Tuesday, March 16, 2010 5:41 PM
 To: solr-user@lucene.apache.org
 Subject: RE: PDFBox/Tika Performance Issues
 
 
 
 Thanks Chris!
 
 
 
 I'll try the patch.
 
 
 
 -Original Message-
 
 From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
 
 Sent: Tuesday, March 16, 2010 5:37 PM
 
 To: solr-user@lucene.apache.org
 
 Subject: Re: PDFBox/Tika Performance Issues
 
 
 
 Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
 include a fix for the problem you're seeing.
 
 
 
 See this discussion [2] on how to patch Tika to use the new PDFBox if you
 can't wait for the 0.7 release which should happen soon (hopefully next few
 weeks).
 
 
 
 Cheers,
 
 Chris
 
 
 
 [1] http://issues.apache.org/jira/browse/TIKA-380
 
 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
 
 
 
 
 
 On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade
 gfernandez-kinc...@capitaliq.com wrote:
 
 
 
 Originally 16 (the number of CPUs on the machine), but even with 5 threads
 it's not looking so hot.
 
 
 
 -Original Message-
 
 From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
 
 Sent: Tuesday, March 16, 2010 5:15 PM
 
 To: solr-user@lucene.apache.org
 
 Subject: Re: PDFBox/Tika Performance Issues
 
 
 
 Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
 the PDFBox project.  How many threads are you indexing with?
 
 
 
 FWIW, for that many documents, I might consider using Tika on the client side
 to save on a lot of network traffic.
 
 
 
 -Grant
 
 
 
 On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:
 
 
 
 I've been trying to bulk index about 11 million PDFs, and while profiling our
 Solr instance, I noticed that all of the threads that are processing indexing
 requests are constantly blocking each other during this call:
 
 
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 
 java.util.Collections$SynchronizedMap.get(Object)
 
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources,
 COSStream)
 
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 
 org.pdfbox.util.PDFTextStripper.processPages(List)
 
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler,
 Metadata)
 
 

Re: field length normalization

2010-03-16 Thread Lance Norskog
You need to change your similarity object to be more sensitive at the
short end. This is a patch about how to do this:

http://issues.apache.org/jira/browse/LUCENE-2187

It involves Lucene coding.

On Fri, Mar 12, 2010 at 3:19 AM, muneeb muneeba...@hotmail.com wrote:

  Ah I see.
 Thanks very much Jay for your explanation, it really helped a lot.

 I guess I have to deal with this in some other way, since I am working with
 short titles and I really want short titles to appear at top. Can you
 suggest anything to bring titles with length 3 to appear before titles with
 length 4 (given they have similar scores)?

 Thanks,


 Jay Hill wrote:

 The fieldNorm is computed like this: fieldNorm = lengthNorm *
 documentBoost
 * documentFieldBoosts

 and the lengthNorm is: lengthNorm  =  1/(numTermsInField)**.5
 [note that the value is encoded as a single byte, so there is some
 precision
 loss]

 So the values are not pre-set for the lengthNorm, but for some counts the
 fieldLength value winds up being the same because of the precision los.
 Here
 is a list of lengthNorm values for 1 to 10 term fields:

 # of terms    lengthNorm
    1          1.0
    2         .625
    3         .5
    4         .5
    5         .4375
    6         .375
    7         .375
    8         .3125
    9         .3125
   10         .3125

 That's why, in your example, the lengthNorm for 3 and 4 is the same.

 -Jay
 http://www.lucidimagination.com





 On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote:



 :
 : Did you reindex after setting omitNorms to false? I'm not sure whether
 or
 : not it is needed, but it makes sense.

 Yes i deleted the old index and reindexed it.
 Just to add another fact, that the titlles length is less than 10. I am
 not
 sure if solr has pre-set values for length normalizations, because for
 titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in
 the
 debugQuery section).


 --
 View this message in context:
 http://old.nabble.com/field-length-normalization-tp27862618p27867025.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://old.nabble.com/field-length-normalization-tp27862618p27874123.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Issue in search

2010-03-16 Thread Suram

In solr how can perform AND, OR, NOT search while querying the data
-- 
View this message in context: 
http://old.nabble.com/Issue-in-search-tp27927828p27927828.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr RAM Requirements

2010-03-16 Thread Dennis Gearon
Just turn your entire disk to RAM

http://www.hyperossystems.co.uk/

800X faster. Who cares if it swaps to 'disk' then :-)


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Tue, 3/16/10, Peter Sturge peter.stu...@googlemail.com wrote:

 From: Peter Sturge peter.stu...@googlemail.com
 Subject: Re: Solr RAM Requirements
 To: solr-user@lucene.apache.org
 Date: Tuesday, March 16, 2010, 6:25 PM
 There are certainly a number of
 widely varying opinions on the use of RAM
 directory.
 Basically, though, if you need the index to be persistent
 at some point
 (i.e. saved across reboots, crashes etc.),
 you'll need to write to a disk, so RAM directory becomes
 somewhat
 superfluous in this case.
 
 Generally, good hardware and fast disks are a better bet,
 since you'll
 probably want to have them anyway :-)
 
 From my own experiences with varying types/sizes of
 indexes, and the general
 wisdom gleamed from the experts, the amount of memory
 required for a given
 environment is very much
 a 'how long is a piece of string' type of scenario. It
 depends on so many
 factors that it's impractical to come up with a easy
 'standardized' formula.
 
 What I've found useful as a rough guidance (in additon to
 the very useful
 URL I mentioned earlier), is if your server is doing lots
 of indexing and
 not much searching, you want your os fs cache to have
 access to a healthy
 amount of memory.
 If you're doing lots of searching/reading (and particularly
 faceting),
 you'll want a good amount of ram for Solr/Lucene caching
 (which caches need
 what depends on the type of data you're searching).
 If you have a server that is doing a lot of both indexing
 and searching, you
 should consider breaking them out using replication and
 possibly using load
 balancers (if you have lots of concurrent querying going
 on).
 
 It stands to reason that the bigger the index gets, the
 more memory will
 generally be required for working on various aspects of it.
 When you get
 into very large indexes, it becomes more efficient to
 distribute the
 indexing across servers (and replicating those servers), so
 that no single
 machine has huge cache lists to traverse. Again, the
 'Scaling Lucene and
 Solr' page goes into these scenarios and is well worth
 studying.
 
 
 
 On Wed, Mar 17, 2010 at 12:29 AM, KaktuChakarabati jimmoe...@gmail.comwrote:
 
 
  Hey Peter,
  Thanks for your reply.
  My question was mainly about the fact there seems to
 be two different
  aspects to the solr RAM usage: in-process and
 out-process.
  By that I mean, yes i know the many different
 parameters/caches to do with
  solr in-process memory usage and related culprits,
 however I also
  understand
  that as for actual index access (posting list,
 positional index etc), solr
  mostly delegates the access/caching of this to the
 OS/disk cache.
  So I guess my question is more about that: namely,
 what would be a good way
  to calculate an overall ram requirement profile for a
 server running solr?
  Also, I was under the impression benefits from
 RAMDirectory would be
  minimal
  when disk caches are effective no?
  And does RAMDirectory work with replication? if so,
 doesnt it slow it down?
  ( on each replication, load up entire index to RAM at
 once? )
 
 
 
  Peter Sturge wrote:
  
   On Tue, Mar 16, 2010 at 9:08 PM,
 KaktuChakarabati
   jimmoe...@gmail.comwrote:
  
  
   Hey,
   I am trying to understand what kind of
 calculation I should do in order
   to
   come up with reasonable RAM size for a given
 solr machine.
  
   Suppose the index size is at 16GB.
   The Max heap allocated to JVM is about 12GB.
  
   The machine I'm trying now has 24GB.
   When the machine is running for a while
 serving production, I can see in
   top
   that the resident memory taken by the jvm is
 indeed at 12gb.
   Now, on top of this i should assume that if i
 want the whole index to
  fit
   in
   disk cache i need about 12gb+16gb = 28GB of
 RAM just for that. Is this
   kind
   of calculation correct or am i off here?
  
  
   Hmmm..not quite. The idea of the ram usage isn't
 to simply hold the index
   in
   memory - if you want this use a RAMDirectory.
   The memory being used will be a combination of
 various caches (Lucene and
   Solr), index buffers et al., and of course the
 server itself. The
   specifics
   depend very
   much on what your server is doing at any given
 time - e.g. lots of
   concurrent searches, lots of indexing, both etc.,
 and how things are
  setup
   in your solrconfig.xml.
  
   A really excellent resource that's worth looking
 at regarding all this
  can
   be found here:
  
  
  http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
  
  
  
   Any other recommendations Anyone could make
 w.r.t these numbers ?
  
   Thanks,
   -Chak
   --
   View this message in context: