Re: DIH transformer script size limitations with Jetty?

2010-08-12 Thread Shalin Shekhar Mangar
On Thu, Aug 12, 2010 at 5:42 AM, harrysmith harrysmith...@gmail.com wrote:


 To follow up on my own question, it appears this is only an issue when
 using
 the DataImport console debugging tools. It looks like when submitting the
 debugging request, the data-config.xml is sent via a GET request, which
 would fail.  However, using the exact same data-config.xml via a
 full-import
 operation (ie not a dry run debug), it looks like the request is sent POST
 and the import works fine.


You are right. In debug mode, the data-config is sent as a GET request. Can
you open a Jira issue?

-- 
Regards,
Shalin Shekhar Mangar.


Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
Hi,

When indexing large amounts of data I hit a problem whereby Solr
becomes unresponsive
and doesn't recover (even when left overnight!). I think i've hit some
GC problems/tuning
is required of GC and I wanted to know if anyone has ever hit this problem.
I can replicate this error (albeit taking longer to do so) using
Solr/Lucene analysers
only so I thought other people might have hit this issue before over
large data sets

Background on my problem follows -- but I guess my main question is -- can Solr
become so overwhelmed by update posts that it becomes completely unresponsive??

Right now I think the problem is that the java GC is hanging but I've
been working
on this all week and it took a while to figure out it might be
GC-based / wasn't a
direct result of my custom analysers so i'd appreciate any advice anyone has
about indexing large document collections.

I also have a second questions for those in the know -- do we have a chance
of indexing/searching over our large dataset with what little hardware
we already
have available??

thanks in advance :)

bec

a bit of background: ---

I've got a large collection of articles we want to index/search over
-- about 180k
in total. Each article has say 500-1000 sentences and each sentence has about
15 fields, many of which are multi-valued and we store most fields as well for
display/highlighting purposes. So I'd guess over 100 million index documents.

In our small test collection of 700 articles this results in a single index of
about 13GB.

Our pipeline processes PDF files through to Solr native xml which we call
index.xml files i.e. in adddoc... format ready to post straight to Solr's
update handler.

We create the index.xml files as we pull in information from
a few sources and creation of these files from their original PDF form is
farmed out across a grid and is quite time-consuming so we distribute this
process rather than creating index.xml files on the fly...

We do a lot of linguistic processing and to enable search functionality
of our resulting terms requires analysers that split terms/ join terms together
i.e. custom analysers that perform string operations and are quite
time-consuming/
have large overhead compared to most analysers (they take approx
20-30% more time
and use twice as many short-lived objects than the text field type).

Right now i'm working on my new Imac:
quad-core 2.8 GHz intel Core i7
16 GB 1067 MHz DDR3 RAM
2TB hard-drive (about half free)
Version 10.6.4 OSX

Production environment:
2 linux boxes each with:
8-core Intel(R) Xeon(R) CPU @ 2.00GHz
16GB RAM

I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
right now).

I setup Solr to use autocommit as we'll have several document collections / post
to Solr from different data sets:

 !-- autocommit pending docs if certain criteria are met.  Future
versions may expand the available
 criteria --
autoCommit
  maxDocs50/maxDocs !-- every 1000 articles --
  maxTime90/maxTime !-- every 15 minutes --
/autoCommit

I also have
  useCompoundFilefalse/useCompoundFile
ramBufferSizeMB1024/ramBufferSizeMB
mergeFactor10/mergeFactor
-

*** First question:
Has anyone else found that Solr hangs/becomes unresponsive after too
many documents are indexed at once i.e. Solr can't keep up with the post rate?

I've got LCF crawling my local test set (file system connection
required only) and
posting documents to Solr using 6GB of RAM. As I said above, these documents
are in native Solr XML format (adddoc) with one file per article so each
add contains all the sentence-level documents for the article.

With LCF I post about 2.5/3k articles (files) per hour -- so about
2.5k*500 /3600 =
350 docs per second post-rate -- is this normal/expected??

Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes
unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is
about 90% full and the following is the end of the solr log file-- where you
can see GC has been called:
--
3012.290: [GC Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 53349392
Max   Chunk Size: 3200168
Number of Blocks: 66
Av.  Block  Size: 808324
Tree  Height: 13
Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree  Height: 0
3012.290: [ParNew (promotion failed): 143071K-142663K(153344K),
0.0769802 secs]3012.367: [CMS
--

I can replicate this with Solr using text field types in place of
those that use my
custom analysers -- whereby Solr takes longer to become unresponsive (about
3 hours / 13k docs) but there is the same kind of GC message at the end
 of the log file / Jconsole shows that the Old-Gen space was 

Re: Analysing SOLR logfiles

2010-08-12 Thread Jay Flattery
Thanks - splunk looks overkill.
We're extremely small scale - were hoping for something open source :-)


- Original Message 
From: Jan Høydahl / Cominvent jan@cominvent.com
To: solr-user@lucene.apache.org
Sent: Wed, August 11, 2010 11:14:37 PM
Subject: Re: Analysing SOLR logfiles

Have a look at www.splunk.com

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 19.34, Jay Flattery wrote:

 Hi there,
 
 
 Just wondering what tools people use to analyse SOLR log files.
 
 We're looking to do things like extracting common queries, calculating 
averaging 

 
 Qtime and hits, returning particularly slow/expensive queries, etc.
 
 Would prefer not to code something (completely) from scratch.
 
 Thanks!
 
 
 
 






Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Robert!

  Since the example given was http being slow, its worth mentioning that if
 queries are one word urls [for example http://lucene.apache.org] these
 will actually form slow phrase queries by default.
   

do you mean that http://lucene.apache.org will be split up into http
lucene apache org and solr will perform a phrase query?

Regards,
Peter.


Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Tom,

I tried again with:
  queryResultCache class=solr.LRUCache size=1 initialSize=1
autowarmCount=1/

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

 Hi Peter,

 Can you give a few more examples of slow queries?  
 Are they phrase queries? Boolean queries? prefix or wildcard queries?
 If one word queries are your slow queries, than CommonGrams won't help.  
 CommonGrams will only help with phrase queries.

 How are you using termvectors?  That may be slowing things down.  I don't 
 have experience with termvectors, so someone else on the list might speak to 
 that.

 When you say the query time for common terms stays slow, do you mean if you 
 re-issue the exact query, the second query is not faster?  That seems very 
 strange.  You might restart Solr, and send a first query (the first query 
 always takes a relatively long time.)  Then pick one of your slow queries and 
 send it 2 times.  The second time you send the query it should be much faster 
 due to the Solr caches and you should be able to see the cache hit in the 
 Solr admin panel.  If you send the exact query a second time (without enough 
 intervening queries to evict data from the cache, ) the Solr queryResultCache 
 should get hit and you should see a response time in the .01-5 millisecond 
 range.

 What settings are you using for your Solr caches?

 How much memory is on the machine?  If your bottleneck is disk i/o for 
 frequent terms, then you want to make sure you have enough memory for the OS 
 disk cache.  

 I assume that http is not in your stopwords.  CommonGrams will only help with 
 phrase queries
 CommonGrams was committed and is in Solr 1.4.  If you decide to use 
 CommonGrams you definitely need to re-index and you also need to use both the 
 index time filter and the query time filter.  Your index will be larger.

 fieldType name=foo ...
 analyzer type=index
 filter class=solr.CommonGramsFilterFactory words=new400common.txt/
 /analyzer

 analyzer type=query
 filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
 /analyzer
 /fieldType



 Tom
 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 3:32 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Improve Query Time For Large Index

 Hi Tom,

 my index is around 3GB large and I am using 2GB RAM for the JVM although
 a some more is available.
 If I am looking into the RAM usage while a slow query runs (via
 jvisualvm) I see that only 750MB of the JVM RAM is used.

   
 Can you give us some examples of the slow queries?
 
 for example the empty query solr/select?q=
 takes very long or solr/select?q=http
 where 'http' is the most common term

   
 Are you using stop words?  
 
 yes, a lot. I stored them into stopwords.txt

   
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
 
 this looks interesting. I read through
 https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
 I only need to enable it via:

 filter class=solr.CommonGramsFilterFactory ignoreCase=true 
 words=stopwords.txt/

 right? Do I need to reindex?

 Regards,
 Peter.

   
 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some 
 memory for the OS.  On the other hand if you are optimizing and refreshing 
 every 10-15 minutes, that will invalidate all the caches, since an optimized 
 index is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop 
 words?  

 If your slow queries are phrase queries, then you might try either adding 
 the most frequent terms in your index to the stopwords list  or try 
 CommonGrams and add them to the common words list.  (Details on CommonGrams 
 here: 
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

 Tom Burton-West

 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Improve Query Time For Large Index

 Hi,

 I have 5 Million small documents/tweets (= ~3GB) and the slave index
 replicates itself from master every 10-15 minutes, so the index is
 optimized before querying. We are using solr 1.4.1 (patched with
 SOLR-1624) via SolrJ.

 Now the search speed is slow 2s for common terms which hits more than 2
 mio docs and acceptable for others: 0.5s. For those numbers I don't use
 highlighting or facets. I am using the following schema [1] and from
 luke handler I know that numTerms =~20 mio. The query for common terms
 stays slow if I 

Re: Improve Query Time For Large Index

2010-08-12 Thread Peter Karich
Hi Tom!

 Hi Peter,

 Can you give a few more examples of slow queries?  
 Are they phrase queries? Boolean queries? prefix or wildcard queries?
   

I am experimenting with one word queries only at the moment.

 If one word queries are your slow queries, than CommonGrams won't help.  
 CommonGrams will only help with phrase queries.
   

hmmh, ok.

 How are you using termvectors? 
yes.

 That may be slowing things down.  I don't have experience with termvectors, 
 so someone else on the list might speak to that.
   

ok. But for highlighting I'll need them to speed things up (a lot).


 When you say the query time for common terms stays slow, do you mean if you 
 re-issue the exact query, the second query is not faster?  That seems very 
 strange. 

Yes. Indeed. The queryResultCache has no hits at all. Strange.

  You might restart Solr, and send a first query (the first query always takes 
 a relatively long time.)  Then pick one of your slow queries and send it 2 
 times.  The second time you send the query it should be much faster due to 
 the Solr caches and you should be able to see the cache hit in the Solr admin 
 panel.  If you send the exact query a second time (without enough intervening 
 queries to evict data from the cache, ) the Solr queryResultCache should get 
 hit and you should see a response time in the .01-5 millisecond range.
   

That's not the case. The second query is only some few milliseconds
faster (but stays 2s). But I'm not sure what I am doing wrong. The
other 3 caches have a good hitratio but queryResultCache has 0. For
queryResultCache I am using:
queryResultCache class=solr.LRUCache size=400 initialSize=400
autowarmCount=400/

But even if I double that it didn't make the hitratio  0

 How much memory is on the machine?  If your bottleneck is disk i/o for 
 frequent terms, then you want to make sure you have enough memory for the OS 
 disk cache.  
   

Yes, there should be enough memory for the OS-disc-cache.

 I assume that http is not in your stopwords.

exactly.


 CommonGrams will only help with phrase queries. CommonGrams was committed and 
 is in Solr 1.4.  If you decide to use CommonGrams you definitely need to 
 re-index and you also need to use both the index time filter and the query 
 time filter.  Your index will be larger.

 fieldType name=foo ...
 analyzer type=index
 filter class=solr.CommonGramsFilterFactory words=new400common.txt/
 /analyzer

 analyzer type=query
 filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
 /analyzer
 /fieldType
   

Thanks, I will try that, if I can solve the current issue :-)
And thanks for all your answers, I will try to experiment with my setup
in more detail now ...

Kind regards,
Peter.



 Subject: Re: Improve Query Time For Large Index

 Hi Tom,

 my index is around 3GB large and I am using 2GB RAM for the JVM although
 a some more is available.
 If I am looking into the RAM usage while a slow query runs (via
 jvisualvm) I see that only 750MB of the JVM RAM is used.

   
 Can you give us some examples of the slow queries?
 
 for example the empty query solr/select?q=
 takes very long or solr/select?q=http
 where 'http' is the most common term

   
 Are you using stop words?  
 
 yes, a lot. I stored them into stopwords.txt

   
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
 
 this looks interesting. I read through
 https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
 I only need to enable it via:

 filter class=solr.CommonGramsFilterFactory ignoreCase=true 
 words=stopwords.txt/

 right? Do I need to reindex?

 Regards,
 Peter.

   
 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some 
 memory for the OS.  On the other hand if you are optimizing and refreshing 
 every 10-15 minutes, that will invalidate all the caches, since an optimized 
 index is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop 
 words?  

 If your slow queries are phrase queries, then you might try either adding 
 the most frequent terms in your index to the stopwords list  or try 
 CommonGrams and add them to the common words list.  (Details on CommonGrams 
 here: 
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

 Tom Burton-West

 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Improve Query Time For Large Index

 Hi,

 I have 5 Million small documents/tweets (= ~3GB) and the slave index
 replicates itself from master every 10-15 minutes, so the index is
 

Re: Analysing SOLR logfiles

2010-08-12 Thread Rebecca Watson
we've just started using awstats - as suggested by the solr 1.4 book.

its open source!:
http://awstats.sourceforge.net/

On 12 August 2010 18:18, Jay Flattery jayc...@rocketmail.com wrote:
 Thanks - splunk looks overkill.
 We're extremely small scale - were hoping for something open source :-)


 - Original Message 
 From: Jan Høydahl / Cominvent jan@cominvent.com
 To: solr-user@lucene.apache.org
 Sent: Wed, August 11, 2010 11:14:37 PM
 Subject: Re: Analysing SOLR logfiles

 Have a look at www.splunk.com

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 11. aug. 2010, at 19.34, Jay Flattery wrote:

 Hi there,


 Just wondering what tools people use to analyse SOLR log files.

 We're looking to do things like extracting common queries, calculating
averaging


 Qtime and hits, returning particularly slow/expensive queries, etc.

 Would prefer not to code something (completely) from scratch.

 Thanks!











Re: Improve Query Time For Large Index

2010-08-12 Thread Robert Muir
exactly!

On Thu, Aug 12, 2010 at 5:26 AM, Peter Karich peat...@yahoo.de wrote:

 Hi Robert!

   Since the example given was http being slow, its worth mentioning that
 if
  queries are one word urls [for example http://lucene.apache.org] these
  will actually form slow phrase queries by default.
 

 do you mean that http://lucene.apache.org will be split up into http
 lucene apache org and solr will perform a phrase query?

 Regards,
 Peter.




-- 
Robert Muir
rcm...@gmail.com


Re: Multiple Facet Dates

2010-08-12 Thread Raphaël Droz

On 05/08/2010 09:59, Raphaël Droz wrote:

Hi,
I saw this post : 
http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
I didn't see work in progress or plans about this feature on the list 
and bugtracker.


Does someone already created a patch, pof, ... I wouldn't have been 
able to find ?
From my naïve point of view the ratio usefulness / added code 
complexity appears as high.


My use-case is to provide, in one request :
- the results count for each one of several years (tag-based exclusion)
- the results count for each month of a given year
- the results count for each day of a given month and year)

I pretty sure someone here already encountered the above, isn't ?

After having understood :
This parameter can be specified on a per field basis.
I created 3 more copy-fields, it's then obvious :

// the real constraint requested
fq={!tag=datefq}date
f.date.facet.date.start=2008-12-08T06:00:00Z
f.date.facet.date.end=2008-12-09T06:00:00Z
f.date.facet.date.gap=+1DAY

// three more field for the total
facet.date={!ex%3Ddatefq}date_for_year
facet.date={!ex%3Ddatefq}date_for_year_month
facet.date={!ex%3Ddatefq}date_for_year_month_day

// the count for all year without the constraint
f.date_for_year.facet.date.start=1970-01-01T06:00:00Z
f.date_for_year.facet.date.end=2011-01-01T06:00:00Z
f.date_for_year.facet.date.gap=+1YEAR

// the count for all month of the year requested (2008) without the 
constraint

f.date_for_year_month.facet.date.start=2008-01-01T06:00:00Z
f.date_for_year_month.facet.date.end=2008-12-31T06:00:00Z
f.date_for_year_month.facet.date.gap=+1MONTH

// idem for the days...

Thanks for your work !

Raph


Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
Hi,

I'm having oome problems with solr. From random browsing
I'm getting an impression that a lot of memory fixes happened
recently in solr and lucene.

Could you give me a quick summary how (un)stable are different
lucene / solr branches and how much improvement I can expect?


Re: Analysing SOLR logfiles

2010-08-12 Thread Peter Karich

I wonder too, that there shouldn't be a special tool which analyzes solr
logfiles (e.g. parses qtime, the parameters q, fq, ...)

Because there are some other open source log analyzers out there:
http://yaala.org/ http://www.mrunix.net/webalizer/

Another free tool is newrelic.com (you will submit your query data to
this site, same as for google analytics). Setup is easy.

For traffic on our site which triggers the solr search we use piwik and
common queries can be extracted easily. Setup was done in 5 minutes.

Regards,
Peter.

 we've just started using awstats - as suggested by the solr 1.4 book.

 its open source!:
 http://awstats.sourceforge.net/

 On 12 August 2010 18:18, Jay Flattery jayc...@rocketmail.com wrote:
   
 Thanks - splunk looks overkill.
 We're extremely small scale - were hoping for something open source :-)


 - Original Message 
 From: Jan Høydahl / Cominvent jan@cominvent.com
 To: solr-user@lucene.apache.org
 Sent: Wed, August 11, 2010 11:14:37 PM
 Subject: Re: Analysing SOLR logfiles

 Have a look at www.splunk.com

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 11. aug. 2010, at 19.34, Jay Flattery wrote:

 
 Hi there,


 Just wondering what tools people use to analyse SOLR log files.

 We're looking to do things like extracting common queries, calculating
 averaging


 Qtime and hits, returning particularly slow/expensive queries, etc.

 Would prefer not to code something (completely) from scratch.

 Thanks!


   



indexing???

2010-08-12 Thread satya swaroop
Hi all,
   The indexing part of solr is going good,but i got a error on indexing
a single pdf file. when i searched for the error in the mailing list i found
that the error was due to copyright of that file. can't we index a file
which has copy rights or any digital rights???

regards,
  satya


Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Lannig Carina
Hi,

I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
The curl command aborts due to a java.lang.OutOfMemoryError.
*
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.lt;initgt;(String.java:215)
at java.lang.StringBuilder.toString(StringBuilder.java:430)
at org.apache.solr.handler.extraction.SolrContentHandler.newDocument(Sol
rContentHandler.java:124)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(Ext
ractingDocumentLoader.java:119)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(Ex
tractingDocumentLoader.java:125)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:195)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:131)
at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:237)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:337)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:240)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
icationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
ilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
alve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
alve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
ava:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
ava:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
ve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
a:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
ss(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48
9)
at java.lang.Thread.run(Thread.java:619)
) that prevented it from fulfilling this request./u/pHR size=1 noshade=n
oshadeh3Apache Tomcat/6.0.26/h3/body/html
*

AFAIK Tika keeps the whole file in RAM and posts it as one single string to 
Solr.
I'm using JVM-args: Xmx1024M and solr default config with
*
  mainIndex
!-- options specific to the main on-disk lucene index --
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor10/mergeFactor
...
  /mainIndex

  requestDispatcher handleSelect=true 
!--Make sure your system has some authentication before enabling remote 
streaming!  --
requestParsers enableRemoteStreaming=true 
multipartUploadLimitInKB=2048000 /
   ...
*
Is there a chance to force Solr/Tika to flush the memory during indexing a file?
Increasing RAM in dependence on the size of the largest file to index seems not 
very nice.
Did I miss some configuration option or do I have to modify Java code? I just 
found http://osdir.com/ml/tika-dev.lucene.apache.org/2009-02/msg00020.html and 
I'm wondering if there is a solution yet.

Carina

Re: Solr branches

2010-08-12 Thread Koji Sekiguchi

(10/08/12 21:06), Tomasz Wegrzanowski wrote:

Hi,

I'm having oome problems with solr. From random browsing
I'm getting an impression that a lot of memory fixes happened
recently in solr and lucene.

Could you give me a quick summary how (un)stable are different
lucene / solr branches and how much improvement I can expect?

   

Lucene/Solr have CHANGES.txt. You can refer to it to see
how much Lucene/Solr get improved from previous release.

Koji

--
http://www.rondhuit.com/en/



Re: Schema Definition Question

2010-08-12 Thread kenf_nc

One way I've done to handle this, and it works only for some types of data,
is to put the searchable part of the sub-doc in a search field
(indexed=true) and put an xml or json representation of the sub-doc in a
stored only field. Then if the main doc is hit via search I can grab the xml
or json, convert it to an object graph and do whatever I want.

If you need to search on a variety of elements in the sub-doc this becomes
less useful an approach. But in some use-cases it worked for me.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-Definition-Question-tp1049966p1110159.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
I am a little confused - how did 180k documents become 100m index documents?
We use have over 20 indices (for different content sets), one with 5m
documents (about a couple of pages each) and another with 100k+ docs.
We can index the 5m collection in a couple of days (limitation is in
the source) which is 100k documents an hour without breaking a sweat.



On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is -- can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone has
 about indexing large document collections.

 I also have a second questions for those in the know -- do we have a chance
 of indexing/searching over our large dataset with what little hardware
 we already
 have available??

 thanks in advance :)

 bec

 a bit of background: ---

 I've got a large collection of articles we want to index/search over
 -- about 180k
 in total. Each article has say 500-1000 sentences and each sentence has
 about
 15 fields, many of which are multi-valued and we store most fields as well
 for
 display/highlighting purposes. So I'd guess over 100 million index
 documents.

 In our small test collection of 700 articles this results in a single index
 of
 about 13GB.

 Our pipeline processes PDF files through to Solr native xml which we call
 index.xml files i.e. in adddoc... format ready to post straight to
 Solr's
 update handler.

 We create the index.xml files as we pull in information from
 a few sources and creation of these files from their original PDF form is
 farmed out across a grid and is quite time-consuming so we distribute this
 process rather than creating index.xml files on the fly...

 We do a lot of linguistic processing and to enable search functionality
 of our resulting terms requires analysers that split terms/ join terms
 together
 i.e. custom analysers that perform string operations and are quite
 time-consuming/
 have large overhead compared to most analysers (they take approx
 20-30% more time
 and use twice as many short-lived objects than the text field type).

 Right now i'm working on my new Imac:
 quad-core 2.8 GHz intel Core i7
 16 GB 1067 MHz DDR3 RAM
 2TB hard-drive (about half free)
 Version 10.6.4 OSX

 Production environment:
 2 linux boxes each with:
 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
 16GB RAM

 I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
 right now).

 I setup Solr to use autocommit as we'll have several document collections /
 post
 to Solr from different data sets:

  !-- autocommit pending docs if certain criteria are met.  Future
 versions may expand the available
  criteria --
 autoCommit
   maxDocs50/maxDocs !-- every 1000 articles --
   maxTime90/maxTime !-- every 15 minutes --
 /autoCommit

 I also have
   useCompoundFilefalse/useCompoundFile
 ramBufferSizeMB1024/ramBufferSizeMB
 mergeFactor10/mergeFactor
 -

 *** First question:
 Has anyone else found that Solr hangs/becomes unresponsive after too
 many documents are indexed at once i.e. Solr can't keep up with the post
 rate?

 I've got LCF crawling my local test set (file system connection
 required only) and
 posting documents to Solr using 6GB of RAM. As I said above, these documents
 are in native Solr XML format (adddoc) with one file per article so
 each
 add contains all the sentence-level documents for the article.

 With LCF I post about 2.5/3k articles (files) per hour -- so about
 2.5k*500 /3600 =
 350 docs per second post-rate -- is this normal/expected??

 Eventually, after about 3000 files (an hour or so) Solr starts to
 hang/becomes
 unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
 is
 about 90% full and the following is the end of the solr log file-- where you
 can see GC has been called:
 --
 3012.290: [GC Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free Space: 53349392
 Max   Chunk Size: 3200168
 Number of Blocks: 66
 Av.  Block  Size: 808324
 Tree  Height: 13
 Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free 

Re: Solr branches

2010-08-12 Thread Tomasz Wegrzanowski
On 12 August 2010 13:46, Koji Sekiguchi k...@r.email.ne.jp wrote:
 (10/08/12 21:06), Tomasz Wegrzanowski wrote:

 Hi,

 I'm having oome problems with solr. From random browsing
 I'm getting an impression that a lot of memory fixes happened
 recently in solr and lucene.

 Could you give me a quick summary how (un)stable are different
 lucene / solr branches and how much improvement I can expect?

 Lucene/Solr have CHANGES.txt. You can refer to it to see
 how much Lucene/Solr get improved from previous release.

This is technically true, but I'm not sufficiently familiar with
solr/lucene development process to infer much about performance
and stability of different branches from it.


Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Gora Mohanty
On Thu, 12 Aug 2010 14:32:19 +0200
Lannig Carina lan...@ssi-schaefer-noell.com wrote:

 Hi,
 
 I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
 The curl command aborts due to a java.lang.OutOfMemoryError.
[...]
 AFAIK Tika keeps the whole file in RAM and posts it as one single
 string to Solr. I'm using JVM-args: Xmx1024M and solr default
 config with
[...]

Do not know about Tika, but what is the size of your Solr index,
and the number of documents in it? Solr seems to need RAM, and
while we did not do real benchmarks then, even with a few tens of
thousands of documents, performance seemed to improve by allocating
2GB RAM. Besides, unless you are on a very tight budget, throwing a
few GB more RAM at the problem seems to be an easy, and not
very expensive, way out.

Regards,
Gora


Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
sorry -- i used the term documents too loosely!

180k scientific articles with between 500-1000 sentences each
and we index sentence-level index documents
so i'm guessing about 100 million lucene index documents in total.

an update on my progress:

i used GC settings of:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
-XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
-XX:CMSInitiatingOccupancyFraction=70

which allowed the indexing process to run to 11.5k articles and
for about 2hours before I got the same kind of hanging/unresponsive Solr with
this as the tail of the solr logs:

Before GC:
Statistics for BinaryTreeDictionary:

Total Free Space: 2416734
Max   Chunk Size: 2412032
Number of Blocks: 3
Av.  Block  Size: 805578
Tree  Height: 3
5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480: [CMS

I also saw (in jconsole) that the number of threads rose from the
steady 32 used for the
2 hours to 72 before Solr finally became unresponsive...

i've got the following GC info params switched on (as many as i could find!):
-XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
-XX:PrintFLSStatistics=1

with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
million fairly small
docs per hour!! this produced an index of about 40GB to give you an
idea of index
size...

because i've already got the documents in solr native xml format
i.e. one file per article each with adddoc.../doc
i.e. posting each set of sentence docs per article in every LCF file post...
this means that LCF can throw documents at Solr very fast and i think i'm
breaking it GC-wise.

i'm going to try adding in System.gc() calls to see if this runs ok
(albeit slower)...
otherwise i'm pretty much at a loss as to what could be causing this GC issue/
solr hanging if it's not a GC issue...

thanks :)

bec

On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote:
 I am a little confused - how did 180k documents become 100m index documents?
 We use have over 20 indices (for different content sets), one with 5m
 documents (about a couple of pages each) and another with 100k+ docs.
 We can index the 5m collection in a couple of days (limitation is in
 the source) which is 100k documents an hour without breaking a sweat.



 On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is -- can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone has
 about indexing large document collections.

 I also have a second questions for those in the know -- do we have a chance
 of indexing/searching over our large dataset with what little hardware
 we already
 have available??

 thanks in advance :)

 bec

 a bit of background: ---

 I've got a large collection of articles we want to index/search over
 -- about 180k
 in total. Each article has say 500-1000 sentences and each sentence has
 about
 15 fields, many of which are multi-valued and we store most fields as well
 for
 display/highlighting purposes. So I'd guess over 100 million index
 documents.

 In our small test collection of 700 articles this results in a single index
 of
 about 13GB.

 Our pipeline processes PDF files through to Solr native xml which we call
 index.xml files i.e. in adddoc... format ready to post straight to
 Solr's
 update handler.

 We create the index.xml files as we pull in information from
 a few sources and creation of these files from their original PDF form is
 farmed out across a grid and is quite time-consuming so we distribute this
 process rather than creating index.xml files on the fly...

 We do a lot of linguistic processing and to enable search functionality
 of our resulting terms requires analysers that split terms/ join terms
 together
 i.e. custom analysers that perform string operations and are quite
 time-consuming/
 have large overhead compared to most analysers (they take approx
 20-30% more time
 and use twice as many short-lived objects than the text field type).

 Right now i'm working on my new Imac:
 quad-core 2.8 GHz intel Core i7
 16 GB 1067 

Deleting with the DIH sometimes doesn't delete

2010-08-12 Thread Qwerky

I'm doing deletes with the DIH but getting mixed results. Sometimes the
documents get deleted, other times I can still find them in the index. What
would prevent a doc from getting deleted?

For example, I delete 594039 and get this in the logs;

2010-08-12 14:41:55,625 [Thread-210] INFO  [DataImporter] Starting Delta
Import
2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Read
productimportupdate.properties
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Starting delta
collection.
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Running
ModifiedRowKey() for Entity: item
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
ModifiedRowKey for Entity: item rows obtained : 0
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
DeletedRowKey for Entity: item rows obtained : 1
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
parentDeltaQuery for Entity: item
2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Deleting stale
documents 
2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Deleting document:
594039
2010-08-12 14:41:55,703 [Thread-210] INFO  [SolrDeletionPolicy] newest
commit = 1281030128383
2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer
DirectUpdateHandler2
2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Delta Import
completed successfully
2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Import completed
successfully
2010-08-12 14:41:55,718 [Thread-210] INFO  [DirectUpdateHandler2] start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer
DirectUpdateHandler2
2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq,
_2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8]

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy] newest
commit = 1281030128384

..this works fine; I can no longer find 594039 in the index. But a little
later I delete a couple more (33252 and 105224) and get the following (I
added two docs at the same time);

2010-08-12 15:27:42,828 [Thread-217] INFO  [DataImporter] Starting Delta
Import
2010-08-12 15:27:42,828 [Thread-217] INFO  [SolrWriter] Read
productimportupdate.properties
2010-08-12 15:27:42,828 [Thread-217] INFO  [DocBuilder] Starting delta
collection.
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Running
ModifiedRowKey() for Entity: item
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
ModifiedRowKey for Entity: item rows obtained : 2
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
DeletedRowKey for Entity: item rows obtained : 2
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
parentDeltaQuery for Entity: item
2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Deleting stale
documents 
2010-08-12 15:27:42,843 [Thread-217] INFO  [SolrWriter] Deleting document:
33252
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onInit: commits:num=1

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy] newest
commit = 1281030128384
2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer
DirectUpdateHandler2
2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrWriter] Deleting document:
105224
2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Delta Import
completed successfully
2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Import completed
successfully
2010-08-12 15:27:42,906 [Thread-217] INFO  [DirectUpdateHandler2] start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer
DirectUpdateHandler2
2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy]
SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
_3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]

commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a,version=1281030128385,generation=10,filenames=[_3c.tis,
_3c.fdt, _3c.fnm, _3c.nrm, _3c.tii, segments_a, _3c.fdx, _3c.prx, _3c.frq]
2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy] newest
commit = 1281030128385
-- 
View this message in context: 

index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I wrote a simple java program to import a pdf file. I can get a result when I 
do search *:* from admin page. I get nothing if I search a word. I wonder if I 
did something wrong or miss set something. 

Here is part of result I get when do *:* search:
*
- doc
- arr name=attr_Author
  strHristovski D/str 
  /arr
- arr name=attr_Content-Type
  strapplication/pdf/str 
  /arr
- arr name=attr_Keywords
  strmicroarray analysis, literature-based discovery, semantic predications, 
natural language processing/str 
  /arr
- arr name=attr_Last-Modified
  strThu Aug 12 10:58:37 EDT 2010/str 
  /arr
- arr name=attr_content
  strCombining Semantic Relations and DNA Microarray Data for Novel 
Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for 
Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej 
Kastrin,2...
*
Please help me out if anyone has experience with pdf files. I really appreciate 
it!

Thanks so much,



Re: index pdf files

2010-08-12 Thread Marco Martinez
To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov

 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 - doc
 - arr name=attr_Author
  strHristovski D/str
  /arr
 - arr name=attr_Content-Type
  strapplication/pdf/str
  /arr
 - arr name=attr_Keywords
  strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
  /arr
 - arr name=attr_Last-Modified
  strThu Aug 12 10:58:37 EDT 2010/str
  /arr
 - arr name=attr_content
  strCombining Semantic Relations and DNA Microarray Data for Novel
 Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
 for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
 Kastrin,2...
 *
 Please help me out if anyone has experience with pdf files. I really
 appreciate it!

 Thanks so much,




RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much. I didn't know how to make any changes in schema.xml for pdf 
files. I used solr default schema.xml. Please tell me what I need do in 
schema.xml.

The simple java program I use is following. I also attached that pdf file. I 
really appreciate your help!
*
public class importPDF {
  public static void main(String[] args) {
try {
String fileName = pub2009001.pdf;
String solrId = pub2009001.pdf;

  indexFilesSolrCell(fileName, solrId);

} catch (Exception ex) {
  System.out.println(ex.toString());
}
  }

 public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {
String urlString = http://lhcinternal.nlm.nih.gov:8989/solr/lhcpdf;;
SolrServer solr = new CommonsHttpSolrServer(urlString);

ContentStreamUpdateRequest up
  = new ContentStreamUpdateRequest(/update/extract);

up.addFile(new File(fileName));

up.setParam(literal.id, solrId);
up.setParam(uprefix, attr_);
up.setParam(fmap.content, attr_content);

up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
  }
}


-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] 
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov

 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 - doc
 - arr name=attr_Author
  strHristovski D/str
  /arr
 - arr name=attr_Content-Type
  strapplication/pdf/str
  /arr
 - arr name=attr_Keywords
  strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
  /arr
 - arr name=attr_Last-Modified
  strThu Aug 12 10:58:37 EDT 2010/str
  /arr
 - arr name=attr_content
  strCombining Semantic Relations and DNA Microarray Data for Novel
 Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
 for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
 Kastrin,2...
 *
 Please help me out if anyone has experience with pdf files. I really
 appreciate it!

 Thanks so much,




how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

please excuse this newbie question, but:  

I want to upgrade solr to a version but not to the latest version in the
trunk (because there are so many changes that I would have to test against,
and modify my custom classes for, and behavior changes, and deal with the
lucene index change, etc)

My thought was to try to look at versions that are post 903398 2010-01-26
20:21:09Z but pre the change in the lucene index.  Eventually picking up the
version that had the features I wanted but with as few other changes as
feasible.  I know I could probably apply a bunch of patches but some of the
patches seem to rely on other patches which rely on other patches which rely
on ...  It just seems easier to pick the version that has just the
features/patches I want.

I have no trouble seeing/using the trunk at
http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have
builds 984777 thru 984832

So where would I find significantly older builds (ie like the one I am
currently using - 903398)?

I tried using svn on repository
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a
Repository moved permanently to
'/viewc/lucene/solr/branches/branch-1.5-dev/' message.

Any help would be great

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
1) I assume you are doing batching interspersed with commits
2) Why do you need sentence level Lucene docs?
3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
surprised if you a memory/connection leak their (or it is not
releasing some resource explicitly)

In general, we have NEVER had a problem in loading Solr.

On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 sorry -- i used the term documents too loosely!

 180k scientific articles with between 500-1000 sentences each
 and we index sentence-level index documents
 so i'm guessing about 100 million lucene index documents in total.

 an update on my progress:

 i used GC settings of:
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
   -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
 -XX:CMSInitiatingOccupancyFraction=70

 which allowed the indexing process to run to 11.5k articles and
 for about 2hours before I got the same kind of hanging/unresponsive Solr
 with
 this as the tail of the solr logs:

 Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free Space: 2416734
 Max   Chunk Size: 2412032
 Number of Blocks: 3
 Av.  Block  Size: 805578
 Tree  Height: 3
 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480:
 [CMS

 I also saw (in jconsole) that the number of threads rose from the
 steady 32 used for the
 2 hours to 72 before Solr finally became unresponsive...

 i've got the following GC info params switched on (as many as i could
 find!):
 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
   -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
   -XX:PrintFLSStatistics=1

 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
 million fairly small
 docs per hour!! this produced an index of about 40GB to give you an
 idea of index
 size...

 because i've already got the documents in solr native xml format
 i.e. one file per article each with adddoc.../doc
 i.e. posting each set of sentence docs per article in every LCF file post...
 this means that LCF can throw documents at Solr very fast and i think
 i'm
 breaking it GC-wise.

 i'm going to try adding in System.gc() calls to see if this runs ok
 (albeit slower)...
 otherwise i'm pretty much at a loss as to what could be causing this GC
 issue/
 solr hanging if it's not a GC issue...

 thanks :)

 bec

 On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote:
 I am a little confused - how did 180k documents become 100m index
 documents?
 We use have over 20 indices (for different content sets), one with 5m
 documents (about a couple of pages each) and another with 100k+ docs.
 We can index the 5m collection in a couple of days (limitation is in
 the source) which is 100k documents an hour without breaking a sweat.



 On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this
 problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is --
 can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone
 has
 about indexing large document collections.

 I also have a second questions for those in the know -- do we have a
 chance
 of indexing/searching over our large dataset with what little hardware
 we already
 have available??

 thanks in advance :)

 bec

 a bit of background: ---

 I've got a large collection of articles we want to index/search over
 -- about 180k
 in total. Each article has say 500-1000 sentences and each sentence has
 about
 15 fields, many of which are multi-valued and we store most fields as
 well
 for
 display/highlighting purposes. So I'd guess over 100 million index
 documents.

 In our small test collection of 700 articles this results in a single
 index
 of
 about 13GB.

 Our pipeline processes PDF files through to Solr native xml which we call
 index.xml files i.e. in adddoc... format ready to post straight to
 Solr's
 update handler.

 We create the index.xml files as we pull in information from
 a few sources and creation of these files from their original PDF form is
 farmed out across a grid and is quite time-consuming so we distribute
 this
 process rather than creating index.xml files on the fly...

 We do a lot of linguistic processing 

Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread Yonik Seeley
Another option is the 3x branch - that should still be able to read
indexes from Solr 1.4/Lucene 2.9
I personally don't expect a 1.5 release to ever materialize.
There will eventually be a Lucene/Solr 3.1 release off of the 3x
branch, and a Lucene/Solr 4.0 release off of trunk.

-Yonik
http://www.lucidimagination.com

On Thu, Aug 12, 2010 at 11:59 AM, solr-user solr-u...@hotmail.com wrote:

 please excuse this newbie question, but:

 I want to upgrade solr to a version but not to the latest version in the
 trunk (because there are so many changes that I would have to test against,
 and modify my custom classes for, and behavior changes, and deal with the
 lucene index change, etc)

 My thought was to try to look at versions that are post 903398 2010-01-26
 20:21:09Z but pre the change in the lucene index.  Eventually picking up the
 version that had the features I wanted but with as few other changes as
 feasible.  I know I could probably apply a bunch of patches but some of the
 patches seem to rely on other patches which rely on other patches which rely
 on ...  It just seems easier to pick the version that has just the
 features/patches I want.

 I have no trouble seeing/using the trunk at
 http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have
 builds 984777 thru 984832

 So where would I find significantly older builds (ie like the one I am
 currently using - 903398)?

 I tried using svn on repository
 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a
 Repository moved permanently to
 '/viewc/lucene/solr/branches/branch-1.5-dev/' message.

 Any help would be great

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Doc Lucene Doc !?

2010-08-12 Thread stockii

no help ? =( 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1114172.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

Thanks Yonik but
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt
says that the lucene index has changed


Upgrading from Solr 1.4
--

* The Lucene index format has changed and as a result, once you upgrade, 
  previous versions of Solr will no longer be able to read your indices.
  In a master/slave configuration, all searchers/slaves should be upgraded
  before the master.  If the master were to be updated first, the older
  searchers would not be able to read the new index format.

not to mention that regression testing is a pain 

Is there any way to get a set of builds with versions prior to 3.x??
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1114353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Hanging during GC?

2010-08-12 Thread Rebecca Watson
hi,

 1) I assume you are doing batching interspersed with commits

as each file I crawl for are article-level each add contains all the
sentences for the article so they are naturally batched into the about
500 documents per post in LCF.

I use auto-commit in Solr:
autoCommit
 maxDocs50/maxDocs !-- every 1000 articles --
 maxTime90/maxTime !-- every 15 minutes --
   /autoCommit

 2) Why do you need sentence level Lucene docs?

that's an application specific need due to linguistic info needed on a
per-sentence
basis.

 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
 surprised if you a memory/connection leak their (or it is not
 releasing some resource explicitly)

I thought this could be the case too -- but if I replace the use of my custom
analysers and specify my fields are of type text instead (from standard
solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging
too -- at least it did when I didn't have any explicit GC settings... it does
take longer to replicate as my analysers/field types are more complex than
text field type.

i will try it again with the different GC settings tomorrow and post
the results.

 In general, we have NEVER had a problem in loading Solr.

i'm not sure if we would either if we posted as we created the
index.xml format...
but because we post 500+ documents a time (one article file per LCF post) and
LCF can post these files quickly i'm not sure if I need to try and slow down
the post rate!?

thanks for your replies,

bec :)

 On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 sorry -- i used the term documents too loosely!

 180k scientific articles with between 500-1000 sentences each
 and we index sentence-level index documents
 so i'm guessing about 100 million lucene index documents in total.

 an update on my progress:

 i used GC settings of:
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
       -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
 -XX:CMSInitiatingOccupancyFraction=70

 which allowed the indexing process to run to 11.5k articles and
 for about 2hours before I got the same kind of hanging/unresponsive Solr
 with
 this as the tail of the solr logs:

 Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free Space: 2416734
 Max   Chunk Size: 2412032
 Number of Blocks: 3
 Av.  Block  Size: 805578
 Tree      Height: 3
 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480:
 [CMS

 I also saw (in jconsole) that the number of threads rose from the
 steady 32 used for the
 2 hours to 72 before Solr finally became unresponsive...

 i've got the following GC info params switched on (as many as i could
 find!):
 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
       -XX:+PrintGCApplicationConcurrentTime 
 -XX:+PrintGCApplicationStoppedTime
       -XX:PrintFLSStatistics=1

 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
 million fairly small
 docs per hour!! this produced an index of about 40GB to give you an
 idea of index
 size...

 because i've already got the documents in solr native xml format
 i.e. one file per article each with adddoc.../doc
 i.e. posting each set of sentence docs per article in every LCF file post...
 this means that LCF can throw documents at Solr very fast and i think
 i'm
 breaking it GC-wise.

 i'm going to try adding in System.gc() calls to see if this runs ok
 (albeit slower)...
 otherwise i'm pretty much at a loss as to what could be causing this GC
 issue/
 solr hanging if it's not a GC issue...

 thanks :)

 bec

 On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote:
 I am a little confused - how did 180k documents become 100m index
 documents?
 We use have over 20 indices (for different content sets), one with 5m
 documents (about a couple of pages each) and another with 100k+ docs.
 We can index the 5m collection in a couple of days (limitation is in
 the source) which is 100k documents an hour without breaking a sweat.



 On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this
 problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is --
 can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone
 has
 

Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread Yonik Seeley
On Thu, Aug 12, 2010 at 12:24 PM, solr-user solr-u...@hotmail.com wrote:
 Thanks Yonik but
 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt
 says that the lucene index has changed

Right - but it will be able to read your older index.
Do you need Solr 1.4 to be able to read the new index once you upgrade?

-Yonik
http://www.lucidimagination.com


edismax pf2 and ps

2010-08-12 Thread Ron Mayer
Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

I find the 'pf' parameter with a pretty large 'ps' to do a very
nice job for providing a modest boost to many documents that are
quite well related to many queries in my system.

In contrast, I find the 'pf2' parameter with zero 'ps' does
extremely well at providing a high boost to documents that
are often exactly what someone's searching for.

Is there any way I can get both effects?

Edismax's pf2 parameter is really nice for boosting exact phrases
in queries like 'black jacket red cap white shoes'.   But as soon
as even a little phrase slop (ps) is added, it seems like it starts
boosting documents with red jackets and white caps just as much as
those with black jackets and red caps.

My gut feeling is that if I could have pf with a large phrase
slop and the pf2 with zero phrase slop, it'd give me better overall
results than any single phrase slop setting that gets applied to both.

Is there any good way for me to test that?

  Thanks,
  Ron



Re: how to update solr to older 1.5 builds instead of to trunk

2010-08-12 Thread solr-user

no, once upgraded I wouldnt need to have an older solr read the indexes. 
misunderstood the note.

thx
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1115694.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Does anyone know if I need define fields in schema.xml for indexing pdf files? 
If I need, please tell me how I can do it. 

I defined fields in schema.xml and created data-configuration file by using 
xpath for xml files. Would you please tell me if I need do it for pdf files and 
how I can do?

Thanks so much for your help as always!

-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] 
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov

 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 - doc
 - arr name=attr_Author
  strHristovski D/str
  /arr
 - arr name=attr_Content-Type
  strapplication/pdf/str
  /arr
 - arr name=attr_Keywords
  strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
  /arr
 - arr name=attr_Last-Modified
  strThu Aug 12 10:58:37 EDT 2010/str
  /arr
 - arr name=attr_content
  strCombining Semantic Relations and DNA Microarray Data for Novel
 Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
 for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
 Kastrin,2...
 *
 Please help me out if anyone has experience with pdf files. I really
 appreciate it!

 Thanks so much,




RE: Improve Query Time For Large Index

2010-08-12 Thread Burton-West, Tom
Hi Peter,

If hits aren't showing up, and you aren't getting any queryResultCache hits 
even with the exact query being repeated, something is very wrong.  I'd suggest 
first getting the query result cache working, and then moving on to look at 
other possible bottlenecks.  

What are your settings for queryResultWindowSize and queryResultMaxDocsCached?

Following up on Robert's point, you might also try to run a few queries in the 
admin interface with the debug flag on to see if the query parser is creating 
phrase queries (assuming you have queries like http://foo.bar.baz).  The 
debug/explain will indicate whether the parsed query is a PhraseQuery.

Tom



-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Thursday, August 12, 2010 5:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Improve Query Time For Large Index

Hi Tom,

I tried again with:
  queryResultCache class=solr.LRUCache size=1 initialSize=1
autowarmCount=1/

and even now the hitratio is still 0. What could be wrong with my setup?

('free -m' shows that the cache has over 2 GB free.)

Regards,
Peter.

 Hi Peter,

 Can you give a few more examples of slow queries?  
 Are they phrase queries? Boolean queries? prefix or wildcard queries?
 If one word queries are your slow queries, than CommonGrams won't help.  
 CommonGrams will only help with phrase queries.

 How are you using termvectors?  That may be slowing things down.  I don't 
 have experience with termvectors, so someone else on the list might speak to 
 that.

 When you say the query time for common terms stays slow, do you mean if you 
 re-issue the exact query, the second query is not faster?  That seems very 
 strange.  You might restart Solr, and send a first query (the first query 
 always takes a relatively long time.)  Then pick one of your slow queries and 
 send it 2 times.  The second time you send the query it should be much faster 
 due to the Solr caches and you should be able to see the cache hit in the 
 Solr admin panel.  If you send the exact query a second time (without enough 
 intervening queries to evict data from the cache, ) the Solr queryResultCache 
 should get hit and you should see a response time in the .01-5 millisecond 
 range.

 What settings are you using for your Solr caches?

 How much memory is on the machine?  If your bottleneck is disk i/o for 
 frequent terms, then you want to make sure you have enough memory for the OS 
 disk cache.  

 I assume that http is not in your stopwords.  CommonGrams will only help with 
 phrase queries
 CommonGrams was committed and is in Solr 1.4.  If you decide to use 
 CommonGrams you definitely need to re-index and you also need to use both the 
 index time filter and the query time filter.  Your index will be larger.

 fieldType name=foo ...
 analyzer type=index
 filter class=solr.CommonGramsFilterFactory words=new400common.txt/
 /analyzer

 analyzer type=query
 filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
 /analyzer
 /fieldType



 Tom
 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 3:32 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Improve Query Time For Large Index

 Hi Tom,

 my index is around 3GB large and I am using 2GB RAM for the JVM although
 a some more is available.
 If I am looking into the RAM usage while a slow query runs (via
 jvisualvm) I see that only 750MB of the JVM RAM is used.

   
 Can you give us some examples of the slow queries?
 
 for example the empty query solr/select?q=
 takes very long or solr/select?q=http
 where 'http' is the most common term

   
 Are you using stop words?  
 
 yes, a lot. I stored them into stopwords.txt

   
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
 
 this looks interesting. I read through
 https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
 I only need to enable it via:

 filter class=solr.CommonGramsFilterFactory ignoreCase=true 
 words=stopwords.txt/

 right? Do I need to reindex?

 Regards,
 Peter.

   
 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some 
 memory for the OS.  On the other hand if you are optimizing and refreshing 
 every 10-15 minutes, that will invalidate all the caches, since an optimized 
 index is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop 
 words?  

 If your slow queries are phrase queries, then you might try either adding 
 the most frequent terms in your index to the stopwords list  or try 
 CommonGrams and add them to the common words list.  (Details on 

Re: index pdf files

2010-08-12 Thread Stefan Moises
Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2


Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:

Does anyone know if I need define fields in schema.xml for indexing pdf files? 
If I need, please tell me how I can do it.

I defined fields in schema.xml and created data-configuration file by using 
xpath for xml files. Would you please tell me if I need do it for pdf files and 
how I can do?

Thanks so much for your help as always!

-Original Message-
From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
Sent: Thursday, August 12, 2010 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

To help you we need the description of your fields in your schema.xml and
the query that you do when you search only a single word.

Marco Martínez Bautista
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov

   

I wrote a simple java program to import a pdf file. I can get a result when
I do search *:* from admin page. I get nothing if I search a word. I wonder
if I did something wrong or miss set something.

Here is part of result I get when do *:* search:
*
-doc
-arr name=attr_Author
  strHristovski D/str
  /arr
-arr name=attr_Content-Type
  strapplication/pdf/str
  /arr
-arr name=attr_Keywords
  strmicroarray analysis, literature-based discovery, semantic
predications, natural language processing/str
  /arr
-arr name=attr_Last-Modified
  strThu Aug 12 10:58:37 EDT 2010/str
  /arr
-arr name=attr_content
  strCombining Semantic Relations and DNA Microarray Data for Novel
Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
Kastrin,2...
*
Please help me out if anyone has experience with pdf files. I really
appreciate it!

Thanks so much,


 
   


--
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



Re: Solr Doc Lucene Doc !?

2010-08-12 Thread kenf_nc

Are you just trying to learn the tiny details of how Solr and DIH work? Is
this just an intellectual curiosity? Or are you having some specific problem
that you are trying to solve? If you have a problem, could you describe the
symptoms of the problem? I am using Solr, DIH, and several other related
technologies and have never needed to know the difference between a
SolrDocument and a LuceneDocument or how the UpdateHandler chains. So I'm
curious about what your ultimate goal is with these questions.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1117472.html
Sent from the Solr - User mailing list archive at Nabble.com.


Results from More then One Cors?

2010-08-12 Thread Jörg Agatz
Hallo Users...

I tryed to get results from more then one Cores..
But i dont know how..

Maby you have a Idea..

I need it into PHP

King


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I defined dynamic field in schema.xml as 
following:
dynamicField name=metadata_* type=string indexed=true stored=true 
multiValued=false/

But I wonder what I should put for uniqueKey/uniqueKey.

I really appreciate your help!

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
 Does anyone know if I need define fields in schema.xml for indexing pdf 
 files? If I need, please tell me how I can do it.

 I defined fields in schema.xml and created data-configuration file by using 
 xpath for xml files. Would you please tell me if I need do it for pdf files 
 and how I can do?

 Thanks so much for your help as always!

 -Original Message-
 From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
 Sent: Thursday, August 12, 2010 11:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: index pdf files

 To help you we need the description of your fields in your schema.xml and
 the query that you do when you search only a single word.

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42


 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov


 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 -doc
 -arr name=attr_Author
   strHristovski D/str
   /arr
 -arr name=attr_Content-Type
   strapplication/pdf/str
   /arr
 -arr name=attr_Keywords
   strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
   /arr
 -arr name=attr_Last-Modified
   strThu Aug 12 10:58:37 EDT 2010/str
   /arr
 -arr name=attr_content
   strCombining Semantic Relations and DNA Microarray Data for Novel
 Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
 for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
 Kastrin,2...
 *
 Please help me out if anyone has experience with pdf files. I really
 appreciate it!

 Thanks so much,


  


-- 
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



Re: Solr Doc Lucene Doc !?

2010-08-12 Thread stockii

i write a little thesis about this. and i need to know how solr is using
lucene -in which way. in example of using dih and searching. so for my
better understanding ..  ;-)


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1118089.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much. I got it work now. I really appreciate your help!
Xiaohui 

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
 Does anyone know if I need define fields in schema.xml for indexing pdf 
 files? If I need, please tell me how I can do it.

 I defined fields in schema.xml and created data-configuration file by using 
 xpath for xml files. Would you please tell me if I need do it for pdf files 
 and how I can do?

 Thanks so much for your help as always!

 -Original Message-
 From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
 Sent: Thursday, August 12, 2010 11:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: index pdf files

 To help you we need the description of your fields in your schema.xml and
 the query that you do when you search only a single word.

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42


 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov


 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 -doc
 -arr name=attr_Author
   strHristovski D/str
   /arr
 -arr name=attr_Content-Type
   strapplication/pdf/str
   /arr
 -arr name=attr_Keywords
   strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
   /arr
 -arr name=attr_Last-Modified
   strThu Aug 12 10:58:37 EDT 2010/str
   /arr
 -arr name=attr_content
   strCombining Semantic Relations and DNA Microarray Data for Novel
 Hypotheses Generation Combining Semantic Relations and DNA Microarray Data
 for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej
 Kastrin,2...
 *
 Please help me out if anyone has experience with pdf files. I really
 appreciate it!

 Thanks so much,


  


-- 
***
Stefan Moises
Senior Softwareentwickler

shoptimax GmbH
Guntherstraße 45 a
90461 Nürnberg
Amtsgericht Nürnberg HRB 21703
GF Friedrich Schreieck

Tel.: 0911/25566-25
Fax:  0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de
***



possible bug in sorting by Function?

2010-08-12 Thread solr-user

I was looking at the ability to sort by Function that was added to solr.

For the most part it seems to work.  However solr doesn't seem to like to
sort by certain functions. 

For example, this sum works:

http://10.0.11.54:8994/solr/select?q=*:*sort=sum(1,Latitude,Longitude,sum(Latitude,Longitude))
asc

but this hsin doesn't work:

http://10.0.11.54:8994/solr/select?q=*:*sort=sum(3959,rad(47.544594),rad(-122.38723),rad(Latitude),rad(Longitude))

and gives me a Must declare sort field or function error, pointing to a
line in QueryParsing.java.

Note that I did apply the SOLR-1297-2.patch supplied by Koji Sekiguchi but
it didn't seem to help.

I am using solr 903398 2010-01-26 20:21:09Z.

Any suggestions appreciated.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118235.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

small typo in last email:  second sum should have been hsin, but I notice
that the problem also occurs when I leave it as sum

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Field getting tokenized prior to charFilter on select query

2010-08-12 Thread Andrew Chalupa

I'm attempting to make use of PatternReplaceCharFilterFactory, but am running 
into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27).  It 
seems that on a real query the charFilter isn't executed prior to the 
tokenizer. 

I modified the example configuration included in the distribution with the 
following fieldType in schema.xml and mapped a new field to it. 
    !-- Field defintion for name text field --
    fieldtype name=nameText class=solr.TextField
      analyzer
        !-- Replace (char  char) or (char and char) with (charchar) --
        charFilter class=solr.PatternReplaceCharFilterFactory
            pattern=(.*?)(\b(\w) (amp;|and) (\w))(.*?) 
replacement=$1$3amp;$5$6/
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.StandardFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.StopFilterFactory/
      /analyzer
    /fieldtype    
    
    field name=name type=nameText indexed=true stored=true 
required=false omitNorms=true /
    
I vaildated that the regex works properly outside of Solr using just Java.  The 
regex attempts to normalize single word characters around an '' into something 
consistent for searching.  For example, it will turn A  B Company into AB 
Company.  The user can then search on AB, A and B, or A  B and the 
proper result will be located.

However, when I import a document with A  B Company I can't ever locate it 
with A  B query.  It can be located with AB query.  When I run 
analysis.jsp it works properly and it will match using any of the combinations.

So from this I concluded that it was being indexed properly, but for some 
reason the query wasn't applying the regex properly.  I hooked up a debugger 
and could see a difference in how the analyzer was applying the charFilter and 
how the query was applying the charFilter.  When the analyzer invoked 
PatternReplaceCharFilterFactory.create(CharStream) the entire field was 
provided in a single call.  When the query invoked 
PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 
seperate tokens (A, , B).  Because of this the regex won't ever locate the 
full string in the field.

I'm using the following encoded URL to perform the query.  
This works
http://localhost:8983/solr/select?q=name:%28a%26b%29

But this doesn't
http://localhost:8983/solr/select?q=name:%28a+%26+b%29

Why is the query parser tokenizing the name field prior to the charFilter 
getting a chance to perform processing? 

XSL import/include relative to app server home directory...

2010-08-12 Thread Brian Johnson
Hello,

I'm customizing my XML response using with the XSLTResponseWriter using
wt=xslttr=transform.xsl. Because I have a few use-cases to support, I
wanted to break up the common bits and import/include them from multiple top
level xslt files, but it appears that the base directory of the transform is
the directory the application was launched in.

Inside my transform.xsl I have this, for example

xsl:import href=common/image-links.xsl/


which results in stack traces such as (copied only the relevant bits).

Caused by: java.io.IOException: Unable to initialize Templates 'transform.xsl'

Caused by: javax.xml.transform.TransformerException: Had IO Exception
with stylesheet file: common/image-links.xsl
Caused by: java.io.FileNotFoundException: C:\dev\jboss-5.1.0.GA
http://jboss-5.1.0.ga/\bin\common\image-links.xsl

This appears to be caused by a lack of provided systemId on the StreamSource
of the xslt document I've requested. I've copied the relevant lines that I
believe are the root cause of the problem here for reference.

TransformFactory.getTemplates():line 105-6

final InputStream xsltStream = loader.openResource(xslt/ + filename);
result = tFactory.newTemplates(new StreamSource(xsltStream));


The loader variable is an instance of solr's ResourceLoader which has no
ability to provide the systemId to set on StreamSource to make relative
references work in the xslt. It seems like we need something along the lines
of

String systemId = loader.getResourceURL().toString() + xslt/;
result = tFactory.newTemplates(new StreamSource(xsltStream, systemId));


I looked for a bug/patch and didn't see anything. Please let me know, if I
missed the patch or if there is another way to solve this problem (aside
from not using xsl:include or xsl:import).

Thanks in advance,

Brian

For reference...
http://onjava.com/pub/a/onjava/excerpt/java_xslt_ch5/index.html?page=5
https://jira.springframework.org/secure/attachment/10163/AbstractXsltView.patch
(similar
bug that was in spring)


Require some advice

2010-08-12 Thread Pavan Gupta
Hi,
I am new to text search and mining and have been doing research for
different available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip ,
city and skills associated with the person. SMS would be in form of free
text. The parsed data would be stored in database and used by Solr to
display results.
A SMS message could in the following form:
John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard
We need to interpret in the following manner:
first name - John
last name - Mayer
city- Mumbai
zip - 411004
area-Juhu
skills - car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of
unstructured SMS messages. Do we have something similar in open source
world? Can we extend Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan


RE: index pdf files

2010-08-12 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I got the following error when I index some pdf files. I wonder if anyone has 
this issue before and how to fix it. Thanks so much in advance!

***
html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.pdfpar...@44ffb2
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
***

-Original Message-
From: Stefan Moises [mailto:moi...@shoptimax.de] 
Sent: Thursday, August 12, 2010 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: index pdf files

Maybe this helps: 
http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2

Cheers,
Stefan

Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]:
 Does anyone know if I need define fields in schema.xml for indexing pdf 
 files? If I need, please tell me how I can do it.

 I defined fields in schema.xml and created data-configuration file by using 
 xpath for xml files. Would you please tell me if I need do it for pdf files 
 and how I can do?

 Thanks so much for your help as always!

 -Original Message-
 From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com]
 Sent: Thursday, August 12, 2010 11:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: index pdf files

 To help you we need the description of your fields in your schema.xml and
 the query that you do when you search only a single word.

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42


 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov


 I wrote a simple java program to import a pdf file. I can get a result when
 I do search *:* from admin page. I get nothing if I search a word. I wonder
 if I did something wrong or miss set something.

 Here is part of result I get when do *:* search:
 *
 -doc
 -arr name=attr_Author
   strHristovski D/str
   /arr
 -arr name=attr_Content-Type
   strapplication/pdf/str
   /arr
 -arr name=attr_Keywords
   strmicroarray analysis, literature-based discovery, semantic
 predications, natural language processing/str
   /arr
 -arr name=attr_Last-Modified
   strThu Aug 12 10:58:37 EDT 2010/str
   /arr
 -arr name=attr_content
   

Free Webinar: Findability: Designing the Search Experience

2010-08-12 Thread Erik Hatcher

Here's perhaps the coolest webinar we've done to date, IMO :)

I attended Tyler's presentation at Lucene EuroCon* and thoroughly  
enjoyed it.  Search UI/UX is a fascinating topic to me, and really  
important to do well for the applications most of us are building.


I'm pleased to pass along the blurb below.  See you there!

Erik

* http://lucene-eurocon.org/sessions-track2-day2.html#3



Lucid Imagination presents a free webinar
Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET
Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap

You don't need billions of dollars or users to build a user-friendly  
search application. In fact, studies of how and why people search have  
revealed a set of principles that can  result in happy users who find  
what they're seeking with as little friction as possible -- and help  
you build a better, more successful search application.


Join special guest Tyler Tate, user experience designer at UK-based  
TwigKit Search, for a high-level discussion of key user interface  
strategies for search that can be leveraged with Lucene and Solr. The  
presentation covers:

* Ten things to know about designing the search experience
* When to assume users know what they’re looking for – and when not to
* Navigation/discovery techniques, such as faceted navigation, tag  
clouds, histograms and more
* Practical considerations in leveraging suggestions into search  
interactions


Lucid Imagination presents a free webinar
Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET
Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap

About the presenter: Tyler Tate is co-founder of TwigKit, a UK-based  
company focused on building truly usable interfaces for search. Tyler  
has led user experience design for enterprise applications from CMS to  
CRM, and is the creator of the popular 1KB CSS Grid. Tyler also  
organizes a monthly Enterprise Search Meetup in London, and blogs at  
blog.twigkit.com.


-
Join the Revolution!
Don't miss Lucene Revolution
Lucene  Solr User Conference
Boston | October 7-8 2010
http://lucenerevolution.org
-

This webinar is sponsored by Lucid Imagination, the commercial entity  
exclusively dedicated to Apache Lucene/Solr open source search  
technology. Our solutions can help you develop and deploy search  
solutions with confidence: SLA-based support subscriptions,  
professional training, best practices consulting, along with and value- 
add software and free documentation and certified distributions of  
Lucene and Solr.


Apache Lucene and Apache Solr are trademarks of the Apache  
Software Foundation.

Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

problem could be related to some oddity in sum()??  some more examples:

note: Latitude and Longitude are fields of type=double

works:
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(1,1.0))%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(Latitude,Latitude)%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(rad(Latitude))%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1))%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1.0))%20asc

fails:
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1),sum(Latitude,1))%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1.0),sum(Latitude,1.0))%20asc
http://10.0.11.54:8994/solr/select?q=*:*sort=sum(rad(Latitude),rad(Latitude))%20asc

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1120017.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Require some advice

2010-08-12 Thread Michael Griffiths
Solr is a search engine, not an entity extraction tool. 

While there are some decent open source entity extraction tools, they are 
focused on processing sentences and paragraphs. The structural differences in 
text messages means you'd need to do a fair amount of work to get decent entity 
extraction.

That said, you may want to look into simple word/phrase matching if your domain 
is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract 
city/area, skills, and names. Much simpler and cheaper. 

-Original Message-
From: Pavan Gupta [mailto:pavan@gmail.com] 
Sent: Thursday, August 12, 2010 2:58 PM
To: solr-user@lucene.apache.org
Subject: Require some advice

Hi,
I am new to text search and mining and have been doing research for different 
available products. My application requires reading a SMS message
(unstructured) and finding out entities such as person name, area, zip , city 
and skills associated with the person. SMS would be in form of free text. The 
parsed data would be stored in database and used by Solr to display results.
A SMS message could in the following form:
John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard
We need to interpret in the following manner:
first name - John
last name - Mayer
city- Mumbai
zip - 411004
area-Juhu
skills - car driver, body guard


1. Is Solr capable enough to handle this application considering that SMS 
message would be unstructured.
2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER 
(stanford university), Lingpipe?
3. Is Solr only text search or can be used for information extraction?
4. Is it recommended to use Solr with other products such as UIMA and GATE.

There are companies that are specialized in making meaning out of unstructured 
SMS messages. Do we have something similar in open source world? Can we extend 
Solr for the same purpose?

You reply would be appreciated.
Thanking you.
Regards,
Pavan


SOLR-788 - disributed More Like This

2010-08-12 Thread Shawn Heisey
 I tried some time ago to use SOLR-788.  Ultimately I was able to get 
both patch versions to apply (separately), but neither worked.  The 
suggestion I received when I commented on the issue was to download the 
specific release mentioned in the patch and then update, but the patch 
was created before the merge with Lucene, so I have no idea how to go 
about that.


Without a much better understanding of Solr internals and a bunch more 
time to learn Java, there's no way that I can work on it myself.  Is 
there anyone who has the time and inclination to get distributed MLT 
working with branch_3x?  A further goal would be to have it actually 
committed before release.


Thanks,
Shawn



Re: possible bug in sorting by Function?

2010-08-12 Thread solr-user

issue resolve.  problem was that solr.war was silently not being overwritten
by new version.

will try to spend more time debugging before posting.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1121349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: General questions about distributed solr shards

2010-08-12 Thread Shawn Heisey

 On 8/11/2010 3:27 PM, JohnRodey wrote:

1) Is there any information on preferred maximum sizes for a single solr
index.  I've read some people say 10 million, some say 80 million, etc...
Is there any official recommendation or has anyone experimented with large
datasets into the tens of billions?

2) Is there any down side to running multiple solr shard instances on a
single machine rather than one shard instance with a larger index per
machine?  I would think that having 5 instances with 1/5 the index would
return results approx 5 times faster.

3) Say you have a solr configuration with multiple shards.  If you attempt
to query while one of the shards is down you will receive a HTTP 500 on the
client due to a connection refused on the server.  Is there a way to tell
the server to ignore this and return as many results as possible?  In other
words if you have 100 shards, it is possible that occasionally a process may
die, but I would still like to return results from the active shards.


1) It highly depends on what's in your index.  I'll let someone more 
qualified address this question in more detail.


2) Distributed search adds overhead.  It has to query the individual 
shards, send additional requests to gather the matching records, and 
then assemble the results.  If you create enough shards that you can fit 
all (or most) of each index in whatever RAM is left for the OS disk 
cache, you'll see a VERY significant boost in search speed by using 
shards.  If


3) There are a couple of patches that address this, but in the end, 
you'll be better served by setting up a replicated pair and using a load 
balancer.  I've got a distributed index with two machines per shard, the 
master and the slave.  The load balancer checks the ping status URL 
every 5 seconds to see whether each machine is up.  If one goes down, it 
is removed from the load balancer and everything keeps working.


Each of my shards is about 12.5GB in size and the VMs that access the 
data have 9GB total RAM.  I wish I had more memory!




Re: clustering component

2010-08-12 Thread Matt Mitchell
Hey thanks Stanislaw! I'm going to try this against the current trunk
tonight and see what happens.

Matt

On Wed, Jul 28, 2010 at 8:41 AM, Stanislaw Osinski 
stanislaw.osin...@carrotsearch.com wrote:

  The patch should also work with trunk, but I haven't verified it yet.
 

 I've just added a patch against solr trunk to
 https://issues.apache.org/jira/browse/SOLR-1804.

 S.



Hierarchical faceting

2010-08-12 Thread Mats Bolstad
Hey all,

I am doing a search on hierarchical data, and I have a hard time
getting my head around the following problem.

I want a result as follows, in one single query only:

USA (3)
 California (2)
 Arizona (1)
Europe (4)
 Norway (3)
 Oslo (3)
 Sweden (1)

How it looks in the XML/JSON response is not really important, this is
more a presentation issue. I guess I could store the values USA,
USA/California, Europe/Norway/Oslo as strings for each document,
and do some JavaScript-ing to show the hierarchies appropriately. When
a specific item in the facet is selected, for example Norway, Solr
could be queries with a filter query on Europe/Norway*?

Do anyone have some experiences they could please share with me?

I have tried out SOLR-64, and it gives me the results I look for.
However, I do not have the opportunity to use a patch in the
production environment ...

--
Thanks,
Mats Bolstad


Re: Phrase search

2010-08-12 Thread Chris Hostetter

: I'm trying to match Apple 2 but not Apple2 using phrase search, this is 
why I have it quoted.

:  I was under the impression --when I use phrase search-- all the 
: analyzer magic would not apply, but it is!!!  Otherwise, how would I 
: search for a phrase?!

well .. yes ... even with phrase searches your query is analyzed.

the only differnce is that with a quoted phrase search, the entire phrase 
is analyzed at one time -- when the input isn't quoted, the whitespace is 
evaluated by the QueryParser as markup just like quotes and +/-, 
etc... (unless it's escaped) and the individual words are analyzed 
independently.

: Using Google, when I search for Windows 7 (with quotes), unlike Solr, 
: I don't get hits on Window7.  I want to use catenateNumbers=1 which 
: I want it to take effect on other searches but no phrase searches.  Is 
: this possible ?

you need to elaborate more on what you do and don't want to match -- so 
far you've given one example of a query you want to execute, and a 
document you *don't* want to match that query, but not an example of what 
types of documents you *do* want to match that query -- you also haven't 
given examples of queries that you *do* want that example document to 
match.

i suspect that catenateNumbers=1 isn't actually your problem ... it 
sounds like you don't actually want WordDelimiterFilter doing the split 
at index time at all.

Forget the phrase queries for a second: the question to ask yourself is: 
when you index a document containing Windows7 do you want a serach for 
the word Windows to match thta document?

If the answer is no then you probably don't want WordDelimiterFilter at 
all.



-Hoss



Re: Solr query result cache size and expire property

2010-08-12 Thread Chris Hostetter

: please help - how can I calculate queryresultcache size (how much RAM should
: be dedicated for that). I have 1,5 index size, 4 mio docs.
: QueryResultWindowSize is 20.
: Could I use expire property on the documents in this cache?

There is no expire property, items are automaticly removed from the 
cache if the cache gets full, and the entire cache is thrown out when a 
new searcher is loaded (that's the only time it would make sense to 
expire anything)

honestly: trial and error is typically the best bet for sizing your 
queryResultsCache ... the size of your index is much less relevant then 
the types of queries you get.  If you typically only have 200 unique 
queries over and over again, and no one ever does any ohter queries, then 
any number abot 200 is going to be essentially the same.

if you have 200 queries thta get a *lot* and 100 other queries that 
get hit once or twice ver ... then something ~250 is probably a good idea 
... any more is probably just a waste of ram, any less is probably a waste 
of CPU.



-Hoss



Re: How to extend the BinaryResponseWriter imposed by Solrj

2010-08-12 Thread Chris Hostetter

: I'm trying to extend the writer used by solrj
: (org.apache.solr.response.BinaryResponseWriter), i have declared it in
...
: I see that it is initialized, but when i try to set the 'wt' param to
: 'myWriter'
: 
: solrQuery.setParam(wt,myWriter), nothing happen, it's still using the
: 'javabin' writer.

I'm not certian, but i don't think SolrJ respects a wt param set by the 
caller .. i think ResponseParser dictates what wt param is sent to the 
server -- that's why javabin is the default and calling 
server.setParser(new XMLResponseParser()) causes XML to be sent by the 
server (even if don't set wt=xml in your SolrParams)

If you've customized the BinaryResponseWriter then presumably you've had 
to write a custom ResponseParser as well, correct? (otherwise how would 
you take advantage of your customizations to hte format) ... so take a 
look at the existing ResponseParsers to see how they force the wt param 
and do the same thing in your custom ResponseParser.

(Note: this is all mostly speculation on my part)

-Hoss



can searcher.getReader().getFieldNames() return only stored fields?

2010-08-12 Thread Gerald

CollectionString myFL =
searcher.getReader().getFieldNames(IndexReader.FieldOption.ALL);

will return all fields in the schema (i.e. index, stored, and
indexed+stored).

CollectionString myFL =
searcher.getReader().getFieldNames(IndexReader.FieldOption.INDEXED );

likely returns all fields that are indexed (I havent tried).

however, both of these can/will return fields that are not stored.  is there
a parameter that I can use to only return fields that are stored?

there does not seem to be a IndexReader.FieldOption.STORED and cant tell if
any of the others might work

any info helpful. thx
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/can-searcher-getReader-getFieldNames-return-only-stored-fields-tp1124178p1124178.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Handler Query

2010-08-12 Thread Manali Joshi
Thanks Alexey. That solved the issue. I am now able to get all images
information in the index.

On Thu, Aug 12, 2010 at 12:47 AM, Alexey Serba ase...@gmail.com wrote:

 Try to define image solr fields - db columns mapping explicitly in
 image entity, i.e.

 entity name=image query=select filename, filepath, type from
 images where story_id='${story.story_id}' 
 field column=filename name=filename /
field column=filepath name=filepath /
field column=type name=type /
 /entity

 See
 http://www.lucidimagination.com/search/document/c8f2ed065ee75651/dih_and_multivariable_fields_problems

 On Thu, Aug 12, 2010 at 2:30 AM, Manali Joshi joshi.man...@gmail.com
 wrote:
  I tried making the schema fields that get the image data to
  multiValued=true. But it still gets only the first image data. It
 doesn't
  have information about all the images.
 
 
 
 
  On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc ken.fos...@realestate.com
 wrote:
 
 
  It may not be the data config. Do you have the fields in the schema.xml
  that
  the image data is going to set to be multiValued=true?
 
  Although, I would think the last image would be stored, not the first,
 but
  haven't really tested this.
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Duplicate a core

2010-08-12 Thread Chris Hostetter

: Is it possible to duplicate a core?  I want to have one core contain only
: documents within a certain date range (ex: 3 days old), and one core with
: all documents that have ever been in the first core.  The small core is then
: replicated to other servers which do real-time processing on it, but the
: archive core exists for longer term searching.

It's not something i've ever dealt with, but if i were going to pursue it 
i would investigate wether this works...

1) have three+ solr instances: master, archive and one or more query 
   machines
2) index everything to core named recent on server master
3) configure the query machines to replicate recent from master
4) configure the archive machine to replicate recent from master
5) configure the archive machine to also have an all core
6) on some timed bases:
   - delete docs from recent on master that are *older* then X
   - delete docs from recent on archive that are *newer* then X
   - use the index merge command on archive to merge the recent 
 core into the all core


...i'm pretty sure that merge command will require that you shutdown both 
cores on archive during the merge, but that's a good idea anyway.

if you need continuous searching of the all core to be available, then 
just setup that core on archive as a repeater and have some 
archive-query machines slaving off of it.


that should work.



-Hoss



SOLR Query

2010-08-12 Thread Moiz Bhukhiya
Hi there,


I've a problem querying SOLR for a specific field with a query string that
contains spaces. I added following lines in the schema.xml to add my own
defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.

Since all these fields are beginning with ap_, I included the the following
dynamicField.
dynamicField name=*ap_* type=text indexed=true stored=true/


I included this line to make a query for all fields instead of a specfic
field.
copyField source=ap_* dest=text/

I added the following document in my index:

add
doc
field name=id1/field
field name=ap_nameTom Cruise/field
field name=ap_addressSan Fransisco/field
/doc
/add

1. When I query q=Tom+Cruise, I should get the above document since it is
available in text which ic my default query field. [Works as expected]
2. When I query q=ap_address:Tom, I should not get the above document since
Tom is not available in ap_address. [Works as expected]
3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
document BUT I GET IT. {Doesnt work as expected]

Could anyone please explain me what mistake I am making?

Thanks alot, appreciate any help!
Moiz


Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

: Furthermore, I would like to add its not just the highlight matches
: functionality that is horribly broken here, but the output of the analysis
: itself is misleading.
: 
: lets say i take 'textTight' from the example, and add the following synonym:
: 
: this is broken = broke
: 
: the query time analysis is wrong, as it clearly shows synonymfilter
: collapsing this is broken to broke, but in reality with the qp for that
: field, you are gonna get 3 separate tokenstreams and this will never
: actually happen (because the qp will divide it up on whitespace first)
: 
: So really the output from 'Query Analyzer' is completely bogus.

analysis.jsp is only intended to explain *analysis* ... it accurately 
tells you what the analyzer type=query ... for the specified field (or 
fieldType) is going to produce given a hunk of text.

That is what it does, that is all that it does, that is all it has ever 
done, and all it has ever purported to do.

You say it's bogus because the qp will divide on whitesapce first -- but 
you're assuming you know what query parser will be used ... the field 
query parser (to name one) doesn't split on whitespace first.  That's my 
point: analysis.jsp doesn't make any assumptions about what query parser 
*might* be used, it just tells you what your analyzers do with strings.

Saying the output of analisys.jsp is bogus because it doesn't take into 
account QueryParsing is like saying the output of stats.jsp is bogus 
because those are only the stats of the local solr instance on that 
machine, and it doesn't do distributed stats -- yeah that would be nice to 
have, but the stats.jsp never implies that's what it's giving you.

If there are ways we can make the purpose of analysis.jsp more obvious, 
and less missleading for people who don't udnerstand the distinction 
between query parsing and analysis then i am all for it.  if you really 
believe getting rid of the highlite check box is going to help, then 
fine -- but i have yet to see any evidence that people who don't 
understand the relationship between query parsing and analysis are 
confused by the blue boxes.

what people seem to be confused by is when they see the same tokens 
ultimately produced by both the index analyzer and the query analyzer 
-- it doesn't matter if those tokens are in blue or not, if they see that 
the tokens in the index analyzer output are a super set of the tokens in 
the query analyzer output then they tend to assume that means searching 
for the string in the query box will match documents containing hte 
string in the index text box.

Getting rid of the blue table cell is just going to make it harder to 
notice matching tokens in the output -- not reduce the confusion when 
those matching tokens exist in the output.

My question is: What can we do to make it more clear what the *purpose* of 
analysis.jsp is?  is there verbage we can add to the page to make it more 
obvious?

NOTE: I'm not just asking Robert, this is a question for the solr-user 
community as a whole.  I *know* what analysis.jsp is for, i've never been 
confused -- for people who have been confused in hte past (or are still 
confused) please help us understand what type of changes we could make to 
the output of analysis.jsp to make it's functionality more understandable.



-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 7:55 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 You say it's bogus because the qp will divide on whitesapce first -- but
 you're assuming you know what query parser will be used ... the field
 query parser (to name one) doesn't split on whitespace first.  That's my
 point: analysis.jsp doesn't make any assumptions about what query parser
 *might* be used, it just tells you what your analyzers do with strings.


you're right, we should just fix the bug that the queryparser tokenizes on
whitespace first. then analysis.jsp will be significantly less confusing.


-- 
Robert Muir
rcm...@gmail.com


Re: Solrj ContentStreamUpdateRequest Slow

2010-08-12 Thread Chris Hostetter

: It returns in around a second.  When I execute the attached code it takes just
: over three minutes.  The optimal for me would be able get closer to the
: performance I'm seeing with curl using Solrj.

I think your problem may be that StreamingUpdateSolrServer buffers up 
commands and sends them in batches in a background thread.  if you want to 
send individual updates in real time (and time them) you should just use 
CommonsHttpSolrServer


-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

:  You say it's bogus because the qp will divide on whitesapce first -- but
:  you're assuming you know what query parser will be used ... the field
:  query parser (to name one) doesn't split on whitespace first.  That's my
:  point: analysis.jsp doesn't make any assumptions about what query parser
:  *might* be used, it just tells you what your analyzers do with strings.
: 
: 
: you're right, we should just fix the bug that the queryparser tokenizes on
: whitespace first. then analysis.jsp will be significantly less confusing.

dude .. not trying to get into a holy war here

even if you change the Lucene QUeryParser so that whitespace isn't a meta 
character it doens't affect the underlying issue: analysis.jsp is agnostic 
about QueryParsers.  Some other QParser the users uses might have other 
special behavior and if people don't understand hte distinction between 
QueryParsing and analysis they can still be confused -- hell even if the 
only QParser anyone ever uses is the lucene QParser, and even if you get 
the QUeryParser changed so that whitespace isn't a metacharacter, you we 
are still going to be left with the fact that *other* charaters (like '+' 
and '-' and '' and '*' and ...) are metacharacters for that query parser, 
and have special meaning.

analysis.jsp isn't going to know about those, or do anything special for 
them -- so people cna still be easily confused when analysis.jsp says 
one thing about how the string +foo* -bar get's analyzed, but that 
string as a query means something completley different.

Hence my point: leave arguments about QueryParser out of it -- how do we 
make the function of analysis.jsp more clear?


-Hoss



Re: Hierarchical faceting

2010-08-12 Thread Jayendra Patil
We were able to get the hierarchy faceting working with a work around
approach.

e.g. if you have Europe//Norway//Oslo as an entry

1. Create a new multivalued field with string type

field name=country_facet type=string indexed=true stored=true
multiValued=true/

2. Index the field for Europe//Norway//Oslo with values

0//Europe
1//Europe//Norway
2//Europe//Norway//Oslo

3. The Facet can now be used in the Queries :-

1st Level - Would return all entries @ 1st level e.g. 0//USA, 0//Europe

fq=

f.country_facet.facet.prefix=0//

facet.field=country_facet


2nd Level - Would return all entries @ second level in Europe
1//Europe//Norway, 1//Europe//Sweden

fq=country_facet:0//Europe

f.country_facet.facet.prefix=1//Europe

facet.field=country_facet



3rd Level - Would return 1//Europe//Norway entries

fq=country_facet:1//Europe//Norway

f.country_facet.facet.prefix=2//Europe//Norway

facet.field=country_facet

Increment the facet.prefix by 1 so that you limit the facet results to to
that prefix.
Also works for any depth.

Regards,
Jayendra


On Thu, Aug 12, 2010 at 6:01 PM, Mats Bolstad mat...@stud.ntnu.no wrote:

 Hey all,

 I am doing a search on hierarchical data, and I have a hard time
 getting my head around the following problem.

 I want a result as follows, in one single query only:

 USA (3)
  California (2)
  Arizona (1)
 Europe (4)
  Norway (3)
  Oslo (3)
  Sweden (1)

 How it looks in the XML/JSON response is not really important, this is
 more a presentation issue. I guess I could store the values USA,
 USA/California, Europe/Norway/Oslo as strings for each document,
 and do some JavaScript-ing to show the hierarchies appropriately. When
 a specific item in the facet is selected, for example Norway, Solr
 could be queries with a filter query on Europe/Norway*?

 Do anyone have some experiences they could please share with me?

 I have tried out SOLR-64, and it gives me the results I look for.
 However, I do not have the opportunity to use a patch in the
 production environment ...

 --
 Thanks,
 Mats Bolstad



Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-12 Thread Chris Hostetter
: 
:  That should still be true in the the official 4.0 release (i really should
:  have said When 4.0 can no longer read SOlr 1.4 indexes), ...
:  i havne't been following the detials closely, but i suspect that tool
:  hasn't been writen yet because there isn't much point until the full
:  details of the trunk index format are nailed down.

: This is news to me?
: 
: File formats are back-compatible between major versions. Version X.N should
: be able to read indexes generated by any version after and including version
: X-1.0, but may-or-may-not be able to read indexes generated by version
: X-2.N.

It was a big part of the proposal regarding hte creation of hte 3x 
branch ... that index format compabtibility between major versions would 
no longer be supported by silently converted on first write -- instead 
there there would be a tool for explicit conversion...

http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation
http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created



-Hoss



Re: edismax pf2 and ps

2010-08-12 Thread Jayendra Patil
We pretty much had the same issue, ended up customizing the ExtendedDismax
code.

In your case its just a change of a single line
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, pslop);
to
addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,
 tiebreaker, 0);

Regards,
Jayendra


On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote:

 Short summary:

   Is there any way I can specify that I want a lot
   of phrase slop for the pf parameter, but none
   at all for the pf2 parameter?

 I find the 'pf' parameter with a pretty large 'ps' to do a very
 nice job for providing a modest boost to many documents that are
 quite well related to many queries in my system.

 In contrast, I find the 'pf2' parameter with zero 'ps' does
 extremely well at providing a high boost to documents that
 are often exactly what someone's searching for.

 Is there any way I can get both effects?

 Edismax's pf2 parameter is really nice for boosting exact phrases
 in queries like 'black jacket red cap white shoes'.   But as soon
 as even a little phrase slop (ps) is added, it seems like it starts
 boosting documents with red jackets and white caps just as much as
 those with black jackets and red caps.

 My gut feeling is that if I could have pf with a large phrase
 slop and the pf2 with zero phrase slop, it'd give me better overall
 results than any single phrase slop setting that gets applied to both.

 Is there any good way for me to test that?

  Thanks,
   Ron




Re: DIH and multivariable fields problems

2010-08-12 Thread Lance Norskog
Please add a JIRA issue for this.
https://issues.apache.org/jira/secure/BrowseProject.jspa

On Tue, Aug 10, 2010 at 6:59 PM, kenf_nc ken.fos...@realestate.com wrote:

 Glad I could help. I also would think it was a very common issue. Personally
 my schema is almost all dynamic fields. I have unique_id, content,
 last_update_date and maybe one other field specifically defined, the rest
 are all dynamic. This lets me accept an almost endless variety of document
 types into the same schema.  So if I planned on using DIH I had to come up
 with a way, and stitching together solutions to a couple related issues got
 me to my script transform. Mine is more convoluted than the one I gave here,
 but obviously you got the gist of the idea.


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor

2010-08-12 Thread Lance Norskog
Please add a JIRA issue for this.

On Wed, Aug 11, 2010 at 6:24 AM, Sascha Szott sz...@zib.de wrote:
 Sorry, there was a mistake in the stack trace. The correct one is:

 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
 value: /home/doe/foo is not a directory Processing Document # 3
        at
 org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)
        at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
        at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
        at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
        at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
        at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
        at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
        at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
        at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)

 -Sascha

 On 11.08.2010 15:18, Sascha Szott wrote:

 Hi folks,

 why does FileListEntityProcessor ignores onError=continue and abort
 indexing if a directory or a file does not exist?

 I'm using both XPathEntityProcessor and FileListEntityProcessor with
 onError set to continue. In case a directory or file is not present an
 Exception is thrown and indexing is stopped immediately.

 Below you can find a stack trace that is generated in case the directory
 /home/doe/foo does not exist:

 SEVERE: Full Import failed
 org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
 value: /home/doe/foo/bar.xml is not a directory Processing Document # 3
 at

 org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122)

 at

 org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)

 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)

 at

 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)

 at

 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)

 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at

 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)

 at

 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)

 at

 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)


 How should I configure both processors so that missing directories and
 files are ignored and the indexing process does not stop immediately?

 Best,
 Sascha




-- 
Lance Norskog
goks...@gmail.com


Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 :  You say it's bogus because the qp will divide on whitesapce first --
 but
 :  you're assuming you know what query parser will be used ... the field
 :  query parser (to name one) doesn't split on whitespace first.  That's
 my
 :  point: analysis.jsp doesn't make any assumptions about what query
 parser
 :  *might* be used, it just tells you what your analyzers do with strings.
 : 
 :
 : you're right, we should just fix the bug that the queryparser tokenizes
 on
 : whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here

 actually I'm suggesting the practical solution: that we fix the primary
problem that makes it confusing.


 even if you change the Lucene QUeryParser so that whitespace isn't a meta
 character it doens't affect the underlying issue: analysis.jsp is agnostic
 about QueryParsers.


analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
your default queryparser is actually a de-facto whitespace tokenizer, don't
try to sugarcoat it.

-- 
Robert Muir
rcm...@gmail.com


Re: Solr 1.4.1 and 3x: Grouping of query changes results

2010-08-12 Thread Chris Hostetter

:  Does not return document as expected:
:  id:1234 AND (-indexid:1 AND -indexid:2) AND -indexid:3
:  
:  Has anyone else experienced this? The exact placement of the parens isn't
:  key, just adding a level of nesting changes the query results.
...
: I could be wrong but I think this has to do with Solr's lack of support for
: purely negative queries, try the following and see if it behaves correctly:
: 
: id:1234 AND (*:* AND -indexid:1 AND -indexid:2) AND -indexid:3

1) correct.  In general a purely negative query can't work -- queries must 
select something, it doesn't matter if they are nested in another query or 
not.

the query string A AND (-B AND -C) AND -D says that a document must 
match A and it must match a query which does not match anything and it 
must not match D ... it's thta middle clause that prevents anything from 
matching.

Solr does support purely negative queries if they are the top level 
query (ie: q=-foo) but it doesn't rewrite nested sub queries (ie: q=foo 
(-bar -baz))

2) FWIW: setting asside the pure negative query aspect of this question, 
changing the grouping of clauses can always affect the results of a query 
-- this is because the grouping dictates the scoring (due to queryNorms 
and coord factors) so A (B C (D E)) F can produce very results in a very 
different order then A B C D E F ... likewise A C -B will match 
different documents then A (C -B)  (the latter will match a document 
containing both A and B, the former will not)


-Hoss



Re: index pdf files

2010-08-12 Thread Chris Hostetter

: Subject: index pdf files
: References: aanlktim1wgref511p+unovqcu=b0usxnm8vxzn5bu...@mail.gmail.com
:  4c63ed43.4030...@r.email.ne.jp
:  aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com
: In-Reply-To: aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Chris Hostetter

: Subject: Indexing large files using Solr Cell causes OutOfMemory error
: References: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com
: In-Reply-To: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: Filter Performance in Solr 1.3

2010-08-12 Thread Lance Norskog
There was a major Lucene change in filter handling from Solr 1.3 to
Solr 1.4. They are much much faster in 1.4. Really Lucene 2.4.1 to
Lucene 2.9.2. The filter is now consulted much earlier in the search
process, thus weeding out many more documents early.

It sounds like in Solr 1.3, you should only use filter queries for
queries with large document sets.

On Wed, Aug 11, 2010 at 12:21 PM, Bargar, Matthew B
matthew.bar...@verizonwireless.com wrote:
 The search with the filter takes longer than a search for the same term
 but no filter after repeated searches, after the cache should have come
 into play. To be more specific, this happens on filters that exclude
 very few results from the overall set.

 For instance, type:video returns few results and as one would expect,
 returns much quicker than a search without that filter.

 -type:video, on the other hand returns a lot of results and excludes
 very few, and actually takes longer than a search without any filter at
 all.

 Is this what one might expect when using a filter that excludes few
 results, or does it still seem like something strange might be
 happening?

 Thanks,
 Matt

 -Original Message-
 From: Geert-Jan Brits [mailto:gbr...@gmail.com]
 Sent: Wednesday, August 11, 2010 2:55 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Filter Performance in Solr 1.3

 fq's are the preferred way to use for filtering when the same filter is
 often  used. (since the filter-set can be cached seperately) .

 as to your direct question:
 My question is whether there is anything that can be done in 1.3 to
 help alleviate the problem, before upgrading to 1.4?

 I don't think so (perhaps some patches that I'm not aware of) .

 When are you seeing increased search time?

 is it the first time the filter is used? If that's the case: that's
 logical since the filter needs to be build.
 (fq)-filters only show their strength (as said above)  when you use them
 repeatedly.

 If on the other hand you're seeing slower repsonse times with a
 fq-filter applied all the time, then the same queries without the
 fq-filter, there must be something strange going on since this really
 shouldn't happen in normal situations.

 Geert-Jan





 2010/8/11 Bargar, Matthew B matthew.bar...@verizonwireless.com

 Hi there, I have a question about filter (fq) performance in Solr 1.3.
 After doing some testing it seems as though adding a filter increases
 search time. From what I've read here
 http://www.derivante.com/2009/06/23/solr-filtering-performance-increas
 e/

 and here
 http://www.lucidimagination.com/blog/2009/05/27/filtered-query-perform
 an
 ce-increases-for-solr-14/

 it seems as though upgrading to 1.4 would solve this problem. My
 question is whether there is anything that can be done in 1.3 to help
 alleviate the problem, before upgrading to 1.4? It becomes an issue
 because the majority of searches that are done on our site need some
 content type excluded or filtered for. Does it make sense to use the
 fq parameter in this way, or is there some better approach since
 filters are almost always used?

 Thank you!





-- 
Lance Norskog
goks...@gmail.com


Re: PDF file

2010-08-12 Thread Chris Hostetter

: Subject: PDF file
: References: 20100729152139.321c4...@ibis
:  aanlktinhby5iasd3q9iep7dr8tymajozvk8curih1...@mail.gmail.com
: In-Reply-To: aanlktinhby5iasd3q9iep7dr8tymajozvk8curih1...@mail.gmail.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: In multicore env, can I make it access core0 by default

2010-08-12 Thread Chris Hostetter

: In-Reply-To: aanlktimwvhxxdhpup5hl-2e1teh9pu6yetopgu=98...@mail.gmail.com
: References: aanlktimwvhxxdhpup5hl-2e1teh9pu6yetopgu=98...@mail.gmail.com
:  aanlktim46b_hcfpf2r6t=b8y_weq4bbhgi=8mappz...@mail.gmail.com
: Subject: In multicore env, can I make it access core0 by default

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: hl.usePhraseHighlighter

2010-08-12 Thread Chris Hostetter

: Subject: hl.usePhraseHighlighter
: References: 1281125904548-1031951.p...@n3.nabble.com
:  960560.55971...@web52904.mail.re2.yahoo.com
: In-Reply-To: 960560.55971...@web52904.mail.re2.yahoo.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


-Hoss



Re: Indexing and ExtractingRequestHandler

2010-08-12 Thread Lance Norskog
This is probably true about Luke. The trunk has a new Lucene format
and does not read any previous format.  The trunk is a busy code base.
The 3.1 branch is slated to be the next Solr release, and is probably
a better base for your testing. Best of all is to use the Solr 1.4.1
binary release.

On Wed, Aug 11, 2010 at 8:08 PM, Harry Hochheiser hsh...@gmail.com wrote:
 Thanks.

 I've done Tika command line to parse the Excel file, and I see
 contents in it that don't appear to be indexed. I've tried the path of
 using Tika to parse the Excel and then using extracting request
 handler to index the resulting text, and that doesn't work either.

 As far as Luke goes, I've built it from scratch. Still bombs. Is it
 possible that it's not compatible with lucene  builds based on trunk?

 thanks,


 -harry

 On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent
 jan@cominvent.com wrote:
 Hi,

 You can try Tika command line to parse your Excel file, then you will se the 
 exact textual output from it, which will be indexed into Solr, and thus 
 inspect whether something is missing.

 Are you sure you use a version of Luke which supports your version of Lucene?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:

 I'm trying to use Solr to index the contents of an Excel file, using
 the ExtractingRequestHandler (CSV handler won't work for me - I need
 to consider the whole spreadsheet as one document), and I'm running
 into some trouble.

 Is there any way to see what's going on during the indexing process?
 I'm concerned that I may be losing some terms, and I'd like to see if
 i can snoop on the terms that are added to the index as they go along.
 How might I do this?

 Barring that, how can I inspect the index post-fact?  I have tried to
 use luke to see what's in the index, but I get an error: Unknown
 format version -10. Is it possible to get luke to work?

 My solr build is straight out of SVN.

 thanks,

 harry






-- 
Lance Norskog
goks...@gmail.com


Re: Deleting with the DIH sometimes doesn't delete

2010-08-12 Thread Lance Norskog
Which version of Solr is this? How many documents are there in the
index? Etc. It is hard for us to help you without more details.


On Thu, Aug 12, 2010 at 8:32 AM, Qwerky neil.j.tay...@hmv.co.uk wrote:

 I'm doing deletes with the DIH but getting mixed results. Sometimes the
 documents get deleted, other times I can still find them in the index. What
 would prevent a doc from getting deleted?

 For example, I delete 594039 and get this in the logs;

 2010-08-12 14:41:55,625 [Thread-210] INFO  [DataImporter] Starting Delta
 Import
 2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Read
 productimportupdate.properties
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Starting delta
 collection.
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Running
 ModifiedRowKey() for Entity: item
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
 ModifiedRowKey for Entity: item rows obtained : 0
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
 DeletedRowKey for Entity: item rows obtained : 1
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Completed
 parentDeltaQuery for Entity: item
 2010-08-12 14:41:55,625 [Thread-210] INFO  [DocBuilder] Deleting stale
 documents
 2010-08-12 14:41:55,625 [Thread-210] INFO  [SolrWriter] Deleting document:
 594039
 2010-08-12 14:41:55,703 [Thread-210] INFO  [SolrDeletionPolicy] newest
 commit = 1281030128383
 2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer
 DirectUpdateHandler2
 2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Delta Import
 completed successfully
 2010-08-12 14:41:55,718 [Thread-210] INFO  [DocBuilder] Import completed
 successfully
 2010-08-12 14:41:55,718 [Thread-210] INFO  [DirectUpdateHandler2] start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
 2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer
 DirectUpdateHandler2
 2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy]
 SolrDeletionPolicy.onCommit: commits:num=2

 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq,
 _2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8]

 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
 _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
 2010-08-12 14:42:10,437 [Thread-210] INFO  [SolrDeletionPolicy] newest
 commit = 1281030128384

 ..this works fine; I can no longer find 594039 in the index. But a little
 later I delete a couple more (33252 and 105224) and get the following (I
 added two docs at the same time);

 2010-08-12 15:27:42,828 [Thread-217] INFO  [DataImporter] Starting Delta
 Import
 2010-08-12 15:27:42,828 [Thread-217] INFO  [SolrWriter] Read
 productimportupdate.properties
 2010-08-12 15:27:42,828 [Thread-217] INFO  [DocBuilder] Starting delta
 collection.
 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Running
 ModifiedRowKey() for Entity: item
 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
 ModifiedRowKey for Entity: item rows obtained : 2
 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
 DeletedRowKey for Entity: item rows obtained : 2
 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Completed
 parentDeltaQuery for Entity: item
 2010-08-12 15:27:42,843 [Thread-217] INFO  [DocBuilder] Deleting stale
 documents
 2010-08-12 15:27:42,843 [Thread-217] INFO  [SolrWriter] Deleting document:
 33252
 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy]
 SolrDeletionPolicy.onInit: commits:num=1

 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
 _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]
 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrDeletionPolicy] newest
 commit = 1281030128384
 2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer
 DirectUpdateHandler2
 2010-08-12 15:27:42,906 [Thread-217] INFO  [SolrWriter] Deleting document:
 105224
 2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Delta Import
 completed successfully
 2010-08-12 15:27:42,906 [Thread-217] INFO  [DocBuilder] Import completed
 successfully
 2010-08-12 15:27:42,906 [Thread-217] INFO  [DirectUpdateHandler2] start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
 2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer
 DirectUpdateHandler2
 2010-08-12 15:27:56,875 [Thread-217] INFO  [SolrDeletionPolicy]
 SolrDeletionPolicy.onCommit: commits:num=2

 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx,
 _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9]

 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a,version=1281030128385,generation=10,filenames=[_3c.tis,
 _3c.fdt, 

Re: indexing???

2010-08-12 Thread Erick Erickson
Can you provide more details? What is the error you're receiving?
What do you think is going on?

It might be helpful if you reviewed:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, Aug 12, 2010 at 8:21 AM, satya swaroop sswaro...@gmail.com wrote:

 Hi all,
   The indexing part of solr is going good,but i got a error on indexing
 a single pdf file. when i searched for the error in the mailing list i
 found
 that the error was due to copyright of that file. can't we index a file
 which has copy rights or any digital rights???

 regards,
   satya



Re: Results from More then One Cors?

2010-08-12 Thread Erick Erickson
There is no information to go on here. Please review
http://wiki.apache.org/solr/UsingMailingLists

and add some more details...

Best
Erick

On Thu, Aug 12, 2010 at 2:09 PM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 Hallo Users...

 I tryed to get results from more then one Cores..
 But i dont know how..

 Maby you have a Idea..

 I need it into PHP

 King



Re: SOLR Query

2010-08-12 Thread Erick Erickson
You'll get a lot of insight into what's actually happening if you append
debugQuery=true to your queries, or check the debug checkbox
in the solr admin page.

But I suspect (and it's a guess since you haven't included your schema)
that your problem is that you're mixing explicit and default fields.
Something
like q=ap_address:Tom+Cruise, I think, gets parsed into something like
ap_address:tom + default_field:cruise

What happens if you try ap_address:(tom +cruise)?

Best
Erick

On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya moiz.bhukh...@gmail.comwrote:

 Hi there,


 I've a problem querying SOLR for a specific field with a query string that
 contains spaces. I added following lines in the schema.xml to add my own
 defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.

 Since all these fields are beginning with ap_, I included the the following
 dynamicField.
 dynamicField name=*ap_* type=text indexed=true stored=true/


 I included this line to make a query for all fields instead of a specfic
 field.
 copyField source=ap_* dest=text/

 I added the following document in my index:

 add
 doc
 field name=id1/field
 field name=ap_nameTom Cruise/field
 field name=ap_addressSan Fransisco/field
 /doc
 /add

 1. When I query q=Tom+Cruise, I should get the above document since it is
 available in text which ic my default query field. [Works as expected]
 2. When I query q=ap_address:Tom, I should not get the above document since
 Tom is not available in ap_address. [Works as expected]
 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
 document BUT I GET IT. {Doesnt work as expected]

 Could anyone please explain me what mistake I am making?

 Thanks alot, appreciate any help!
 Moiz



Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:29 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 It was a big part of the proposal regarding hte creation of hte 3x
 branch ... that index format compabtibility between major versions would
 no longer be supported by silently converted on first write -- instead
 there there would be a tool for explicit conversion...


 http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation

 http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created



Hoss, did you actually *read* these documents


We will only provide a conversion tool that can convert indexes from
the last branch_3x up to this trunk (4.0) release, so they can be
read later, but may not contain terms with all current analyzers, so
people need mostly reindexing. Older indexes will not be able to be
read natively without conversion first (with maybe loss of analyzer
compatibility).



the fact 4.0 can read 3.x indexes *at all* without a converter tool is
only because Mike Mccandless went the extra mile.


i dont see anything suggesting we should support any tools for 2.x indexes!

-- 
Robert Muir
rcm...@gmail.com


DataImportHandler and SAXParseExceptions with Jetty

2010-08-12 Thread harrysmith

Win XP, Solr 1.4.1 out of the box install, using jetty. If I add greater than
or less than (ie  or ) in any xml field and attempt to load or run from
the DataImportConsole I receive a SAXParseException. Example follows:

If I don't have a 'less than' it works just fine. I know this must work,
because the examples given on the wiki show deltaQueries using a greater
than/less than compare.


Relevant snippet from data-config.xml :

entity name=item query=select * from project_items where rownum  500

Stack trace received:
org.apache.solr.common.SolrException: FATAL: Could not create importer.
DataImporter config invalid
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:121)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:222)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception occurred while initializing context
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:190)
at
org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:101)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
... 22 more
Caused by: org.xml.sax.SAXParseException: The value of attribute query
associated with an element type null must not contain the '' character.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:178)
... 24 more

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-and-SAXParseExceptions-with-Jetty-tp1125898p1125898.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR Query

2010-08-12 Thread Moiz Bhukhiya
I tried ap_address:(tom+cruise) and that worked. I am sure its the same
problem as you suspected!

Thanks a lot Erick( users!) for your time.
Moiz

On Thu, Aug 12, 2010 at 8:51 PM, Erick Erickson erickerick...@gmail.comwrote:

 You'll get a lot of insight into what's actually happening if you append
 debugQuery=true to your queries, or check the debug checkbox
 in the solr admin page.

 But I suspect (and it's a guess since you haven't included your schema)
 that your problem is that you're mixing explicit and default fields.
 Something
 like q=ap_address:Tom+Cruise, I think, gets parsed into something like
 ap_address:tom + default_field:cruise

 What happens if you try ap_address:(tom +cruise)?

 Best
 Erick

 On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya moiz.bhukh...@gmail.com
 wrote:

  Hi there,
 
 
  I've a problem querying SOLR for a specific field with a query string
 that
  contains spaces. I added following lines in the schema.xml to add my own
  defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec.
 
  Since all these fields are beginning with ap_, I included the the
 following
  dynamicField.
  dynamicField name=*ap_* type=text indexed=true stored=true/
 
 
  I included this line to make a query for all fields instead of a specfic
  field.
  copyField source=ap_* dest=text/
 
  I added the following document in my index:
 
  add
  doc
  field name=id1/field
  field name=ap_nameTom Cruise/field
  field name=ap_addressSan Fransisco/field
  /doc
  /add
 
  1. When I query q=Tom+Cruise, I should get the above document since it is
  available in text which ic my default query field. [Works as expected]
  2. When I query q=ap_address:Tom, I should not get the above document
 since
  Tom is not available in ap_address. [Works as expected]
  3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above
  document BUT I GET IT. {Doesnt work as expected]
 
  Could anyone please explain me what mistake I am making?
 
  Thanks alot, appreciate any help!
  Moiz