Re: How to index PDF file stored in SQL Server 2008

2011-04-11 Thread Roy Liu
Hi, all
Thank YOU very much for your kindly help.

*1. I have upgrade from Solr 1.4 to Solr 3.1*
*2. Change data-config-sql.xml *

dataConfig
  dataSource type=JdbcDataSource
  name=*bsds*
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver

url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager
  user=username
  password=pw/
  datasource name=*docds* type=*BinURLDataSource* /

  document name=docs
entity name=*doc* dataSource=*bsds*
query=select id,attachment,filename from attachment where
ext='pdf' and id30001030 
field column=id name=id /
*entity dataSource=docds processor=TikaEntityProcessor
url=${doc.attachment} format=text **
field column=attachment name=bs_attachment /
/entity*
field column=filename name=title /
/entity
  /document
/dataConfig

*3. solrconfig.xml and schema.xml are NOT changed.*

However, when I access
*http://localhost:8080/solr/dataimport?command=full-import*

It still has errors:
Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query:[B@ae1393 Processing Document # 1

Could you give me some advices. This problem is so boring me.
Thanks.

-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 5:16 AM, Lance Norskog goks...@gmail.com wrote:

 You have to upgrade completely to the Apache Solr 3.1 release. It is
 worth the effort. You cannot copy any jars between Solr releases.
 Also, you cannot copy over jars from newer Tika releases.

 On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman darxo...@gmail.com wrote:
  Hi again
  what you are missing is field mapping
  field column=id name=id /
  
 
 
  no need for TikaEntityProcessor  since you are not accessing pdf files
 



 --
 Lance Norskog
 goks...@gmail.com



Clustering with grouping

2011-04-11 Thread ramires
hi
we use solr trunk nightly 4.0. We grouped our results with no problem. When
we try to clustering these with this
clustering?q=rosegroup=truegroup.field=site  we get 500 error.

Problem accessing /solr/clustering. Reason:

null

java.lang.NullPointerException
at
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:89)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:245)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1290)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


here solrconfig clustering part:

  

  true
  stc
  true
 
  title
  url
 
   content,url
   
   true
   
   
   
   false
   
   dismax
 explicit

0.01
 
content^0.5 anchor^1.0 title^1.2
 
 
content^0.5 anchor^1.5 title^1.2 site^1.5
 
 
recip(date,1,1000,1000)^0.3
 
 
2-1 5-2 690%
  
 100

   *:*
   100
   score
 

  clustering

  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Clustering-with-grouping-tp2805496p2805496.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing Best Practice

2011-04-11 Thread Darx Oman
Hi guys

I'm wondering how to best configure solr to fulfills my requirements.

I'm indexing data from 2 data sources:
1- Database
2- PDF files (password encrypted)

Every file has related information stored in the database.  Both the file
content and the related database fields must be indexed as one document in
solr.  Among the DB data is *per-user* permissions for every document.

The file contents nearly never change, on the other hand, the DB data and
especially the permissions change very frequently which require me to
re-index everything for every modified document.

My problem is in process of decrypting the PDF files before re-indexing them
which takes too much time for a large number of documents, it could span to
days in full re-indexing.

What I'm trying to accomplish is eliminating the need to re-index the PDF
content if not changed even if the DB data changed.  I know this is not
possible in solr, because solr doesn't update documents.

So how to best accomplish this:

Can I use 2 indexes one for PDF contents and the other for DB data and have
a common id field for both as a link between them, *and results are treated
as one Document*?


Re: How to index PDF file stored in SQL Server 2008

2011-04-11 Thread Roy Liu
Hi,

I have copied
\apache-solr-3.1.0\dist\apache-solr-dataimporthandler-extras-3.1.0.jar

into \apache-tomcat-6.0.32\webapps\solr\WEB-INF\lib\

Other Errors:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Unclosed
quotation mark after the character string 'B@3e574'.

-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 2:12 PM, Darx Oman darxo...@gmail.com wrote:

 Hi there

 Error is not clear...

 but did you copy apache-solr-dataimporthandler-extras-4.0-SNAPSHOT.jar
 to your solr\lib ?



Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Mike
Hi All,

I have the same issue. I have installed solr instance on tomcat6. When try
to index pdf I am running into the below exception:

11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NoClassDefFoundError:
org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ClassNotFoundException:
org.apache.tika.exception.TikaException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more

I could not found any tika jar file.
Could you please help me out in fixing the above issue.

Thanks,
Mike

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.1 performance compared to 1.4.1

2011-04-11 Thread Marius van Zwijndregt
Hi Yonik !

Thanks for your reply.

I decided to switch to 3.1 and see if the performance would settle down
after building up a proper index. Looking at the average response time from
both installations i can see that 3.1 is now actually performing much better
than 1.4.1 (1.4.1 shows an average of 43ms, 3.1 shows 32ms)

My earlier test (with new keywords) now shows that 3.1 also outperforms
1.4.1 with keywords which have not yet been queried.

For the record, the tests are ran on ubuntu 10.04 (8GB ram, Quad core,
software raid 1). Ive given both installations a jvm with 1GB of ram. Ive
unpacked a new installation of 3.1 besides 1.4.1, and copied in the (in my
case) missing parts of configuration (dataimporter, sql xml config and
schema additions).

Cheers !

Marius

2011/4/10 Yonik Seeley yo...@lucidimagination.com

 On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt
 pionw...@gmail.com wrote:
  Hello !
 
  I'm new to the list, have been using SOLR for roughly 6 months and love
 it.
 
  Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation
  (Ubuntu server, same JVM params). I have copied the configuration from
 1.4.1
  to the 3.1.
  Both version are running fine, but one thing ive noticed, is that the
 QTime
  on 3.1, is much slower for initial searches than on the (currently
  production) 1.4.1 installation.
 
  For example:
 
  Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime
  returns 371
  Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime
  returns 59
 
  Using debugQuery=true, i can see that the main time is spend in the query
  component itself (org.apache.solr.handler.component.QueryComponent).
 
  Can someone explain this, and how can i analyze this further ? Does it
 take
  time to build up a decent query, so could i switch to 3.1 without having
 to
  worry ?

 Thanks for the report... there's no reason that anything should really
 be much slower, so it would be great to get to the bottom of this!

 Is this using the same index as the 1.4.1 server, or did you rebuild it?

 Are there any other query parameters (that are perhaps added by
 default, like faceting or anything else that could take up time) or is
 this truly just a term query?

 What platform are you on?  I believe the Lucene Directory
 implementation now tries to be smarter (compared to lucene 2.9) about
 picking the best default (but it may not be working out for you for
 some reason).

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco



Re: How to index PDF file stored in SQL Server 2008

2011-04-11 Thread Roy Liu
I changed data-config-sql.xml to
dataConfig
  dataSource type=JdbcDataSource
  name=bsds
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver

url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager
  user=username
  password=pw
  convertType=true
  /

  document name=docs
entity name=doc dataSource=bsds
query=select id,filename,attachment from attachment where
ext='pdf' and id=3632 
field column=id name=id /
field column=filename name=title /
field column=attachment name=bs_attachment /
/entity
  /document
/dataConfig


There are no errors, but, the indexed pdf is convert to Numbers..
200 1 202 1 203 1 212 1 222 1 236 1 242 1 244 1 254 1 255
-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 2:02 PM, Roy Liu liuchua...@gmail.com wrote:

 Hi, all
 Thank YOU very much for your kindly help.

 *1. I have upgrade from Solr 1.4 to Solr 3.1*
 *2. Change data-config-sql.xml *

 dataConfig
   dataSource type=JdbcDataSource
   name=*bsds*
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver

 url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager
   user=username
   password=pw/
   datasource name=*docds* type=*BinURLDataSource* /

   document name=docs
 entity name=*doc* dataSource=*bsds*
 query=select id,attachment,filename from attachment where
 ext='pdf' and id30001030 

 field column=id name=id /
 *entity dataSource=docds processor=TikaEntityProcessor
 url=${doc.attachment} format=text **
 field column=attachment name=bs_attachment /
 /entity*
 field column=filename name=title /
 /entity
   /document
 /dataConfig

 *3. solrconfig.xml and schema.xml are NOT changed.*

 However, when I access

 *http://localhost:8080/solr/dataimport?command=full-import*

 It still has errors:
 Full Import
 failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query:[B@ae1393 Processing Document # 1

 Could you give me some advices. This problem is so boring me.
 Thanks.

 --
 Best Regards,
 Roy Liu



 On Mon, Apr 11, 2011 at 5:16 AM, Lance Norskog goks...@gmail.com wrote:

 You have to upgrade completely to the Apache Solr 3.1 release. It is
 worth the effort. You cannot copy any jars between Solr releases.
 Also, you cannot copy over jars from newer Tika releases.

 On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman darxo...@gmail.com wrote:
  Hi again
  what you are missing is field mapping
  field column=id name=id /
  
 
 
  no need for TikaEntityProcessor  since you are not accessing pdf files
 



 --
 Lance Norskog
 goks...@gmail.com





Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Roy Liu
\apache-solr-3.1.0\contrib\extraction\lib\tika*.jar

-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 3:10 PM, Mike satish01sud...@gmail.com wrote:

 Hi All,

 I have the same issue. I have installed solr instance on tomcat6. When try
 to index pdf I am running into the below exception:

 11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NoClassDefFoundError:
 org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at

 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at
 org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at

 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
at

 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at

 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.tika.exception.TikaException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more

 I could not found any tika jar file.
 Could you please help me out in fixing the above issue.

 Thanks,
 Mike

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr
Hi Lance,

your are right:
XPathEntityProcessor has the attribut xsl, so I can use xslt to generate a 
xml-File in the form of the standard Solr update schema.
I will check the performance of this.


Best regards
  Karsten


btw. flatten is an attribute of the field-Tag, not of XPathEntityProcessor 
(like wrongly specified it the wiki)


 Lance
 There is an option somewhere to use the full XML DOM implementation
 for using xpaths. The purpose of the XPathEP is to be as simple and
 dumb as possible and handle most cases: RSS feeds and other open
 standards.
 
 Search for xsl(optional)
 
 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
 
 Karsten
 On Sat, Apr 9, 2011 at 5:32 AM
  Hi Folks,
 
  does anyone improve DIH XPathRecordReader to deal with nested xpaths?
  e.g.
  data-config.xml with
   entity .. processor=XPathEntityProcessor ..
   field column=title xpath=//body/h1/
   field column=alltext” xpath=//body flatten=true/
  and the XML stream contains
   /html/body/h1...
  will only fill field “alltext” but field “title” will be empty.
 
  This is a known issue from 2009
 
 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
 
  So three questions:
  1. How to fill a “search over all”-Field without nested xpaths?
    (schema.xml  copyField source=* dest=alltext/ will not help,
 because we lose the original token order)
  2. Does anyone try to improve XPathRecordReader to deal with nested
 xpaths?
  3. Does anyone else need this feature?
 
 
  Best regards
   Karsten
 
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html


RE: Solr under Tomcat

2011-04-11 Thread Mike
Hi All,

I have installed solr instance on tomcat6. When i tried to index the PDF
file i was able to see the response:


0
479


Query:
http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true

But when i tried to search the content in the pdf i could not get any
results:



0
2
−

on
0
struts
10
2.2




 
Could you please let me know if I am doing anything wrong. It works fine
when i tried with default jetty server prior to integrating on the tomcat6.

I have followed installation steps from
http://wiki.apache.org/solr/SolrTomcat
(Tomcat on Windows Single Solr app).

Thanks,
Mike



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Mike
Hi Roy,

Thank you for the quick reply. When i tried to index the PDF file i was able
to see the response:


0
479



Query:
http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true

But when i tried to search the content in the pdf i could not get any
results:



0
2
−

on
0
struts
10
2.2




 
Could you please let me know if I am doing anything wrong. It works fine
when i tried with default jetty server prior to integrating on the tomcat6.

I have followed installation steps from
http://wiki.apache.org/solr/SolrTomcat
(Tomcat on Windows Single Solr app).

Thanks,
Mike



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Gary Taylor

Jayendra,

Thanks for the info - been keeping an eye on this list in case this 
topic cropped up again.  It's currently a background task for me, so 
I'll try and take a look at the patches and re-test soon.


Joey - glad you brought this issue up again.  I haven't progressed any 
further with it.  I've not yet moved to Solr 3.1 but it's on my to-do 
list, as is testing out the patches referenced by Jayendra.  I'll post 
my findings on this thread - if you manage to test the patches before 
me, let me know how you get on.


Thanks and kind regards,
Gary.


On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com  wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl 
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F  myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.






--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE



Spellchecker with synonyms

2011-04-11 Thread royr
Hello,

I have some synonyms for city names. Sometimes there are multiple names for
one city, example:.

newyork, newyork city, big apple

I search for big apple and get results with new york(synonym)
If somebody search for big aple i want a spelling suggestion like: big
apple. How can i fix that synonyms
are available for the spellchecker?









--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellchecker with synonyms

2011-04-11 Thread lboutros
Did you configure synonyms for your field at query time ?

Ludovic.

2011/4/11 royr [via Lucene] ml-node+2806028-1349039134-383...@n3.nabble.com


 Hello,

 I have some synonyms for city names. Sometimes there are multiple names for
 one city, example:.

 newyork, newyork city, big apple

 I search for big apple and get results with new york(synonym)
 If somebody search for big aple i want a spelling suggestion like: big
 apple. How can i fix that synonyms
 are available for the spellchecker?









 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html
  To start a new topic under Solr - User, email
 ml-node+472068-1765922688-383...@n3.nabble.com
 To unsubscribe from Solr - User, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=.




-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spellchecker with synonyms

2011-04-11 Thread royr
Yes, it looks like this:


  
   
   
   
   
   
  


 will work on query and index time i think.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806157.html
Sent from the Solr - User mailing list archive at Nabble.com.


XML not coming through from nabble to Gmail

2011-04-11 Thread Erick Erickson
All:

Lately I've been seeing a lot of posts where people paste in parts of their
schema.xml or solrconfig.xml and the results are...er...disappointing. None
of the less-than or greater-than symbols show and the formatting is all over
the map.

Since some mails would come through with the XML formatted and some would be
wonky, at first I thought it was the sender, but then a pretty high
percentage came through this way. So I poked around and it seems to only be
the case that the XML is wonkified (tm) when it's comes to Gmail from
nabble, the original post on nabble has the markup and displays fine.
Behavior is the same in Chrome and Firefox BTW.

Does anyone have any insight into this? Time to complain to the nabble
folks? Do others see this with non-Gmail relays?

Thanks,
Erick


Can I set up a config-based distributed search

2011-04-11 Thread Ran Peled
In the Distributed Search page (
http://wiki.apache.org/solr/DistributedSearch), it is documented that in
order to perform a distributed search over a sharded index, I should use the
shards request parameter, listing the shards to participate in the search
(e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
new pretty large index (1B+ items).  Say I have a 100 shards, specifying the
shards on the request URL becomes unrealistic due to length of URL.  It is
also redundant to do that on every request.

Is there a way to specify the list of shards in a configuration file,
instead of on the query URL?  I have seen references to relevant config in
SolrCloud, but as I understand it planned to be released only in Solr 4.0.

Thanks,
Ran


Re: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Michael McCandless
Tom,

I think I see where this may be -- it looks like another  2B terms
bug in Lucene (we are using an int instead of a long in the
TermInfoAndOrd class inside TermInfosReader.java), only present in
3.1.

I'm also mad that Test2BTerms fails to catch this!!  I will go fix
that test and confirm it sees this bug.

Can you build from source?  If so, try this patch:

Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
===
--- lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(revision
1089906)
+++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(working copy)
@@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
-final int termOrd;
-public TermInfoAndOrd(TermInfo ti, int termOrd) {
+final long termOrd;
+public TermInfoAndOrd(TermInfo ti, long termOrd) {
   super(ti);
   this.termOrd = termOrd;
 }
@@ -245,7 +245,7 @@
 // wipe out the cache when they iterate over a large numbers
 // of terms in order
 if (tiOrd == null) {
-  termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+  termsCache.put(cacheKey, new TermInfoAndOrd(ti,
enumerator.position));
 } else {
   assert sameTermInfo(ti, tiOrd, enumerator);
   assert (int) enumerator.position == tiOrd.termOrd;
@@ -262,7 +262,7 @@
 // random-access: must seek
 final int indexPos;
 if (tiOrd != null) {
-  indexPos = tiOrd.termOrd / totalIndexInterval;
+  indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
 } else {
   // Must do binary search:
   indexPos = getIndexOffset(term);
@@ -274,7 +274,7 @@
 if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
   ti = enumerator.termInfo();
   if (tiOrd == null) {
-termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position));
   } else {
 assert sameTermInfo(ti, tiOrd, enumerator);
 assert (int) enumerator.position == tiOrd.termOrd;

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
 Lucene Specification Version: 3.1-SNAPSHOT
 Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Just before the exception we see this entry in our tomcat logs:

 Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
 Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute

 Is this a known bug?  Can anyone provide a clue as to how we can determine 
 what the problem is?

 Tom Burton-West


 Appended Below is the exception stack trace:

 SEVERE: Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
        at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
        at 
 org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
        at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
        at 
 org.apache.lucene.index.DirectoryReader$MultiTermEnum.init(DirectoryReader.java:1055)
        at 
 org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:659)
        at 
 org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
        at 
 org.apache.solr.request.NumberedTermEnum.skipTo(UnInvertedField.java:1018)
        at 
 org.apache.solr.request.UnInvertedField.getTermText(UnInvertedField.java:838)
        at 
 org.apache.solr.request.UnInvertedField.getCounts(UnInvertedField.java:617)
        at 
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:279)
        at 
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:312)
        at 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:174)
        at 
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)

Re: Can I set up a config-based distributed search

2011-04-11 Thread lboutros
You can add to your search handler the shards parameter :

requestHandler name=dist-search class=solr.SearchHander
lst name default
str name=shards host1/solr, host2/solrstr/
/lst
/requestHandler

Is is what you are looking for ?

Ludovic.

2011/4/11 Ran Peled [via Lucene] 
ml-node+2806331-346788257-383...@n3.nabble.com

 In the Distributed Search page (
 http://wiki.apache.org/solr/DistributedSearch), it is documented that in
 order to perform a distributed search over a sharded index, I should use
 the
 shards request parameter, listing the shards to participate in the search

 (e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
 new pretty large index (1B+ items).  Say I have a 100 shards, specifying
 the
 shards on the request URL becomes unrealistic due to length of URL.  It is
 also redundant to do that on every request.

 Is there a way to specify the list of shards in a configuration file,
 instead of on the query URL?  I have seen references to relevant config in
 SolrCloud, but as I understand it planned to be released only in Solr 4.0.

 Thanks,
 Ran


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806331.html
  To start a new topic under Solr - User, email
 ml-node+472068-1765922688-383...@n3.nabble.com
 To unsubscribe from Solr - User, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=.




-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806763.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?

2011-04-11 Thread Mike
Hi All,

I am new to solr. I want to implement solr search.

I have to implement two search buttons(1. books and 2. computers and both
are in the same datasource) which are completely different there is no
relation between each other.
Could you please let know how to define the entities in data-config.xml and
also on schema.xml.

Is it possible to do something like:


  
 
  

 
   

   


   
 
   

   



Thanks,
Mike
 
  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-create-multiple-doc-using-DIH-and-access-the-data-pertaining-to-a-particular-doc-n-tp1877203p2806787.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Best Practice

2011-04-11 Thread Shaun Campbell
If it's of any help I've split the processing of PDF files from the
indexing. I put the PDF content into a text file (but I guess you could load
it into a database) and use that as part of the indexing.  My processing of
the PDF files also compares timestamps on the document and the text file so
that I'm only processing documents that have changed.

I am a newbie so perhaps there's more sophisticated approaches.

Hope that helps.
Shaun

On 11 April 2011 07:20, Darx Oman darxo...@gmail.com wrote:

 Hi guys

 I'm wondering how to best configure solr to fulfills my requirements.

 I'm indexing data from 2 data sources:
 1- Database
 2- PDF files (password encrypted)

 Every file has related information stored in the database.  Both the file
 content and the related database fields must be indexed as one document in
 solr.  Among the DB data is *per-user* permissions for every document.

 The file contents nearly never change, on the other hand, the DB data and
 especially the permissions change very frequently which require me to
 re-index everything for every modified document.

 My problem is in process of decrypting the PDF files before re-indexing
 them
 which takes too much time for a large number of documents, it could span to
 days in full re-indexing.

 What I'm trying to accomplish is eliminating the need to re-index the PDF
 content if not changed even if the DB data changed.  I know this is not
 possible in solr, because solr doesn't update documents.

 So how to best accomplish this:

 Can I use 2 indexes one for PDF contents and the other for DB data and have
 a common id field for both as a link between them, *and results are treated
 as one Document*?



Reloading synonyms.txt without downtime

2011-04-11 Thread Otis Gospodnetic
Hi,

Apparently, when one RELOADs a core, the synonyms file is not reloaded.  Is 
this 

the expected behaviour?  Is it the desired behaviour?

Here's the use-case:
When one is doing purely query-time synonym expansion, ideally one would be 
able 

to edit synonyms.txt and get it reloaded, so that the changes can start taking 
effect immediately.

One might think that RELOADing a Solr core would achieve this, but apparently 
this doesn't happen.  Should it?
Are there technical reasons why RELOADing a core should not reload the synonyms 
file? (other than if synonyms are used at index-time, changing the synonyms 
would mean that one has to reindex old docs in order for changes to synonyms to 
apply to old docs).

Issue https://issues.apache.org/jira/browse/SOLR-1307 mentions this a bit, but 
doesn't go in a lot of depth.

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: Can I set up a config-based distributed search

2011-04-11 Thread Jonathan Rochkind
I have not worked with shards/distributed, but I think you can probably 
specify them as defaults in your requesthandler in your solrconfig.xml 
instead.


Somewhere there is (or was) a wiki page on this I can't find right now. 
There's a way to specify (for a particular request handler) a default 
parameter value, such as for 'shards', that will be used if none were 
given with the request. There's also a way to specify an invariant that 
will always be used even if something else is passed in on the request.


Ah, found it: http://wiki.apache.org/solr/SearchHandler#Configuration

On 4/11/2011 8:31 AM, Ran Peled wrote:

In the Distributed Search page (
http://wiki.apache.org/solr/DistributedSearch), it is documented that in
order to perform a distributed search over a sharded index, I should use the
shards request parameter, listing the shards to participate in the search
(e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
new pretty large index (1B+ items).  Say I have a 100 shards, specifying the
shards on the request URL becomes unrealistic due to length of URL.  It is
also redundant to do that on every request.

Is there a way to specify the list of shards in a configuration file,
instead of on the query URL?  I have seen references to relevant config in
SolrCloud, but as I understand it planned to be released only in Solr 4.0.

Thanks,
Ran



Re: Performance with search terms starting and ending with wildcards

2011-04-11 Thread Otis Gospodnetic
Hi,

Perhaps you should give Lucene/Solr trunk a try and compare!  The Wildcard 
query 
in trunk should be much faster.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Ueland tor.henn...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sun, April 10, 2011 10:44:46 AM
 Subject: Performance with search terms starting and ending with wildcards
 
 Hi!
 
 I have been doing some testing with solr and wildcards. Queries  like:
 
 - *foo
 - foo*
 
 Does complete quickly(1-2s) in a test index  on about 40-50GB.
 
 But when i try to do a search for *foo*, the search  time can without any
 trouble come upwards for 30seconds plus. 
 
 Any  ideas on how that issue can be worked around? 
 
 One fix would be to change  *foo* to (*foo or foo* or oof* or *oof) (is the
 reverse even needed?). But  that will not give the same results as *foo*,
 logicly enough.
 
 I have  also tried to set maxTimeAllowed, but that is simply ignored. I guess
 that is  related to either sorting or the wildcard search itself. 
 
 --
 View this  message in context: 
http://lucene.472066.n3.nabble.com/Performance-with-search-terms-starting-and-ending-with-wildcards-tp2802561p2802561.html

 Sent  from the Solr - User mailing list archive at Nabble.com.
 


Clarifying fetchindex command

2011-04-11 Thread Otis Gospodnetic
Hi,

Can one actually *force* replication of the index from the master without a 
commit being issued on the master since the last replication?

I do see Force a fetchindex on slave from master command: 
http://slave_host:port/solr/replication?command=fetchindex; on 
http://wiki.apache.org/solr/SolrReplication#HTTP_API, but that feels more like 
force the replication *now* instead of waiting for the slave to poll the 
master than force the replication even if there is no new commit point and no 
new index version on the master.  Which one is it, really?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
Thanks Mike,

At first I thought this couldn't be related to the 2.1 Billion terms issue 
since the only place we have tons of terms is in the OCR field and this is not 
the OCR field. But then I remembered that the total number of terms in all 
fields is what matters. We've had no problems with regular searches against the 
index or with other facet queries.  Only with this facet.   Is TermInfoAndOrd 
only used for faceting?

I'll go ahead and build the patch and let you know.


Tom

p.s. Here is the field definition:
field name=topicStr type=string indexed=true stored=false 
multiValued=true/
fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true/


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 8:40 AM
To: solr-user@lucene.apache.org
Cc: Burton-West, Tom
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Tom,

I think I see where this may be -- it looks like another  2B terms
bug in Lucene (we are using an int instead of a long in the
TermInfoAndOrd class inside TermInfosReader.java), only present in
3.1.

I'm also mad that Test2BTerms fails to catch this!!  I will go fix
that test and confirm it sees this bug.

Can you build from source?  If so, try this patch:

Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
===
--- lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(revision
1089906)
+++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(working copy)
@@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
-final int termOrd;
-public TermInfoAndOrd(TermInfo ti, int termOrd) {
+final long termOrd;
+public TermInfoAndOrd(TermInfo ti, long termOrd) {
   super(ti);
   this.termOrd = termOrd;
 }
@@ -245,7 +245,7 @@
 // wipe out the cache when they iterate over a large numbers
 // of terms in order
 if (tiOrd == null) {
-  termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+  termsCache.put(cacheKey, new TermInfoAndOrd(ti,
enumerator.position));
 } else {
   assert sameTermInfo(ti, tiOrd, enumerator);
   assert (int) enumerator.position == tiOrd.termOrd;
@@ -262,7 +262,7 @@
 // random-access: must seek
 final int indexPos;
 if (tiOrd != null) {
-  indexPos = tiOrd.termOrd / totalIndexInterval;
+  indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
 } else {
   // Must do binary search:
   indexPos = getIndexOffset(term);
@@ -274,7 +274,7 @@
 if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
   ti = enumerator.termInfo();
   if (tiOrd == null) {
-termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position));
   } else {
 assert sameTermInfo(ti, tiOrd, enumerator);
 assert (int) enumerator.position == tiOrd.termOrd;

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
 Lucene Specification Version: 3.1-SNAPSHOT
 Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Just before the exception we see this entry in our tomcat logs:

 Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
 Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute

 Is this a known bug?  Can anyone provide a clue as to how we can determine 
 what the problem is?

 Tom Burton-West


 Appended Below is the exception stack trace:

 SEVERE: Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
        at 
 org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
        at 
 org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
        at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
        at 
 

RE: Problems indexing very large set of documents

2011-04-11 Thread Brandon Waterloo
I found a simpler command-line method to update the PDF files.  On some 
documents it does so perfect, the result is a pixel-for-pixel match and none of 
the OCR text (which is what all these PDFs are, newspaper articles that have 
been passed through OCR) is lost.  However, on other documents the result is 
considerably blurrier and some of the OCR text is lost.

We've decided to skip any documents that Tika cannot index for now.

As Lance stated, it's not specifically the version that causes the problem but 
rather some quirks caused by different PDF writers, a few tests have confirmed 
this, so we can't use version to determine which should be skipped.  I'm 
examining the XML responses from the queries, and I cannot figure out how to 
tell from the XML response whether or not a document was successfully indexed.  
The status value seems to be 0 regardless of whether indexing was successful or 
not.

So my question is, how can I tell from the response whether or not indexing was 
actually successful?

~Brandon Waterloo


From: Lance Norskog [goks...@gmail.com]
Sent: Sunday, April 10, 2011 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems indexing very large set of documents

There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is dirty PDF just like dirty HTML.

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
brandon.water...@matrix.msu.edu wrote:
 I think I've finally found the problem.  The files that work are PDF version 
 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into 
 updating all the old documents to PDF 1.6.

 Thanks everyone!

 ~Brandon Waterloo
 
 From: Ezequiel Calderara [ezech...@gmail.com]
 Sent: Friday, April 08, 2011 11:35 AM
 To: solr-user@lucene.apache.org
 Cc: Brandon Waterloo
 Subject: Re: Problems indexing very large set of documents

 Maybe those files are created with a different Adobe Format version...

 See this: 
 http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

 On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
 brandon.water...@matrix.msu.edumailto:brandon.water...@matrix.msu.edu 
 wrote:
 A second test has revealed that it is something to do with the contents, and 
 not the literal filenames, of the second set of files.  I renamed one of the 
 second-format files and tested it and Solr still failed.  However, the 
 problem still only applies to those files of the second naming format.
 
 From: Brandon Waterloo 
 [brandon.water...@matrix.msu.edumailto:brandon.water...@matrix.msu.edu]
 Sent: Friday, April 08, 2011 10:40 AM
 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Subject: RE: Problems indexing very large set of documents

 I had some time to do some research into the problems.  From what I can tell, 
 it appears Solr is tripping up over the filename.  These are strictly 
 examples, but, Solr handles this filename fine:

 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

 However, it fails with either a parsing error or an EOF exception on this 
 filename:

 32-130-A08-84-al.sff.document.nusa197102.pdf

 The only significant difference is that the second filename contains multiple 
 periods.  As there are about 1700 files whose filenames are similar to the 
 second format it is simply not possible to change their filenames.  In 
 addition they are being used by other applications.

 Is there something I can change in Solr configs to fix this issue or am I 
 simply SOL until the Solr dev team can work on this? (assuming I put in a 
 ticket)

 Thanks again everyone,

 ~Brandon Waterloo


 
 From: Chris Hostetter 
 [hossman_luc...@fucit.orgmailto:hossman_luc...@fucit.org]
 Sent: Tuesday, April 05, 2011 3:03 PM
 To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org
 Subject: RE: Problems indexing very large set of documents

 : It wasn't just a single file, it was dozens of files all having problems
 : toward the end just before I killed the process.
   ...
 : That is by no means all the errors, that is just a sample of a few.
 : You can see they all threw HTTP 500 errors.  What is strange is, nearly
 : every file succeeded before about the 2200-files-mark, and nearly every
 : file after that failed.

 ..the root question is: do those files *only* fail if you have already
 indexed ~2200 files, or do they fail if you start up your server and index
 them first?

 there may be a resource issued 

Re: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Michael McCandless
Right, it's the total number of terms across all fields... unfortunately.

This class is used to enroll a term into the terms cache that wraps
the terms dictionary, so in theory you could also hit this issue
during normal searching when a term is looked up once,  and then
looked up again (the 2nd time will pull from the cache).

I've mod'd Test2BTerms and am running it now...

Mike

http://blog.mikemccandless.com

On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 At first I thought this couldn't be related to the 2.1 Billion terms issue 
 since the only place we have tons of terms is in the OCR field and this is 
 not the OCR field. But then I remembered that the total number of terms in 
 all fields is what matters. We've had no problems with regular searches 
 against the index or with other facet queries.  Only with this facet.   Is 
 TermInfoAndOrd only used for faceting?

 I'll go ahead and build the patch and let you know.


 Tom

 p.s. Here is the field definition:
 field name=topicStr type=string indexed=true stored=false 
 multiValued=true/
 fieldType name=string class=solr.StrField sortMissingLast=true 
 omitNorms=true/


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, April 11, 2011 8:40 AM
 To: solr-user@lucene.apache.org
 Cc: Burton-West, Tom
 Subject: Re: ArrayIndexOutOfBoundsException with facet query

 Tom,

 I think I see where this may be -- it looks like another  2B terms
 bug in Lucene (we are using an int instead of a long in the
 TermInfoAndOrd class inside TermInfosReader.java), only present in
 3.1.

 I'm also mad that Test2BTerms fails to catch this!!  I will go fix
 that test and confirm it sees this bug.

 Can you build from source?  If so, try this patch:

 Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
 ===
 --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (revision
 1089906)
 +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (working copy)
 @@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
 -    final int termOrd;
 -    public TermInfoAndOrd(TermInfo ti, int termOrd) {
 +    final long termOrd;
 +    public TermInfoAndOrd(TermInfo ti, long termOrd) {
       super(ti);
       this.termOrd = termOrd;
     }
 @@ -245,7 +245,7 @@
             // wipe out the cache when they iterate over a large numbers
             // of terms in order
             if (tiOrd == null) {
 -              termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +              termsCache.put(cacheKey, new TermInfoAndOrd(ti,
 enumerator.position));
             } else {
               assert sameTermInfo(ti, tiOrd, enumerator);
               assert (int) enumerator.position == tiOrd.termOrd;
 @@ -262,7 +262,7 @@
     // random-access: must seek
     final int indexPos;
     if (tiOrd != null) {
 -      indexPos = tiOrd.termOrd / totalIndexInterval;
 +      indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
     } else {
       // Must do binary search:
       indexPos = getIndexOffset(term);
 @@ -274,7 +274,7 @@
     if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
       ti = enumerator.termInfo();
       if (tiOrd == null) {
 -        termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +        termsCache.put(cacheKey, new TermInfoAndOrd(ti, 
 enumerator.position));
       } else {
         assert sameTermInfo(ti, tiOrd, enumerator);
         assert (int) enumerator.position == tiOrd.termOrd;

 Mike

 http://blog.mikemccandless.com

 On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 
 16:00:54
 Lucene Specification Version: 3.1-SNAPSHOT
 Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Just before the exception we see this entry in our tomcat logs:

 Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
 Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute

 Is this a known bug?  Can anyone provide a clue as to how we can determine 
 what the problem is?

 Tom Burton-West


 

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
Thanks Mike,

With the unpatched version, the first time I run the facet query on topicStr it 
works fine, but the second time I get the ArrayIndexOutOfBoundsException.   If 
I try different facets such as language, I don't see the same symptoms.  Maybe 
the number of facet values needs to exceed some number to trigger the bug?

I rebuilt lucene-core-3.1-SNAPSHOT.jar  with your patch and it fixes the 
problem. 


Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 1:00 PM
To: Burton-West, Tom
Cc: solr-user@lucene.apache.org
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Right, it's the total number of terms across all fields... unfortunately.

This class is used to enroll a term into the terms cache that wraps
the terms dictionary, so in theory you could also hit this issue
during normal searching when a term is looked up once,  and then
looked up again (the 2nd time will pull from the cache).

I've mod'd Test2BTerms and am running it now...

Mike

http://blog.mikemccandless.com

On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 At first I thought this couldn't be related to the 2.1 Billion terms issue 
 since the only place we have tons of terms is in the OCR field and this is 
 not the OCR field. But then I remembered that the total number of terms in 
 all fields is what matters. We've had no problems with regular searches 
 against the index or with other facet queries.  Only with this facet.   Is 
 TermInfoAndOrd only used for faceting?

 I'll go ahead and build the patch and let you know.


 Tom

 p.s. Here is the field definition:
 field name=topicStr type=string indexed=true stored=false 
 multiValued=true/
 fieldType name=string class=solr.StrField sortMissingLast=true 
 omitNorms=true/


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, April 11, 2011 8:40 AM
 To: solr-user@lucene.apache.org
 Cc: Burton-West, Tom
 Subject: Re: ArrayIndexOutOfBoundsException with facet query

 Tom,

 I think I see where this may be -- it looks like another  2B terms
 bug in Lucene (we are using an int instead of a long in the
 TermInfoAndOrd class inside TermInfosReader.java), only present in
 3.1.

 I'm also mad that Test2BTerms fails to catch this!!  I will go fix
 that test and confirm it sees this bug.

 Can you build from source?  If so, try this patch:

 Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
 ===
 --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (revision
 1089906)
 +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
 (working copy)
 @@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
 -    final int termOrd;
 -    public TermInfoAndOrd(TermInfo ti, int termOrd) {
 +    final long termOrd;
 +    public TermInfoAndOrd(TermInfo ti, long termOrd) {
       super(ti);
       this.termOrd = termOrd;
     }
 @@ -245,7 +245,7 @@
             // wipe out the cache when they iterate over a large numbers
             // of terms in order
             if (tiOrd == null) {
 -              termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +              termsCache.put(cacheKey, new TermInfoAndOrd(ti,
 enumerator.position));
             } else {
               assert sameTermInfo(ti, tiOrd, enumerator);
               assert (int) enumerator.position == tiOrd.termOrd;
 @@ -262,7 +262,7 @@
     // random-access: must seek
     final int indexPos;
     if (tiOrd != null) {
 -      indexPos = tiOrd.termOrd / totalIndexInterval;
 +      indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
     } else {
       // Must do binary search:
       indexPos = getIndexOffset(term);
 @@ -274,7 +274,7 @@
     if (enumerator.term() != null  term.compareTo(enumerator.term()) == 0) {
       ti = enumerator.termInfo();
       if (tiOrd == null) {
 -        termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
 enumerator.position));
 +        termsCache.put(cacheKey, new TermInfoAndOrd(ti, 
 enumerator.position));
       } else {
         assert sameTermInfo(ti, tiOrd, enumerator);
         assert (int) enumerator.position == tiOrd.termOrd;

 Mike

 http://blog.mikemccandless.com

 On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote:
 The query below results in an array out of bounds exception:
 select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr

 Here is the exception:
  Exception during facet.field of 
 topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
        at 
 org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)

 We are using a dev version of Solr/Lucene:

 Solr Specification Version: 3.0.0.2010.11.19.16.00.54
 Solr Implementation Version: 

Lucene Revolution 2011 - Early Bird Ends April 18

2011-04-11 Thread Michael Bohlig

A quick reminder that there's one week left on special pricing for Lucene 
Revolution 2011. Sign up this week and save some serious cash:

- Conference Registration, now $545, a savings of $180 over the $725 late 
registration price
- Training Package with 2-day Training plus Conference Registration now 
$1695, a savings of $200 over the 
  $1895 late registration package price (and even more savings over the a 
la carte pricing)

What can you expect at the conference?

- Keynote presentations from The Guardian News and Media’s Stephen Dunn and 
Redmonk’s Stephen O’Grady
- Session track talks on use cases, tutorials and technology strategy at 
leading edge, innovative
  companies, including: Travelocity, eBay, eHarmony, EMC, Etsy, Trulia, 
Intuit, Careerbuilder, ATT, The
  Ladders and more
- Deep internals and implementation guidance at talks by Apache Solr/Lucene 
committers including Grant 
  Ingersoll, Yonik Seeley, Andrzej Bialecki, Uwe Schindler, Simon 
Willnauer, Erik Hatcher, Otis 
  Gospodnetic, and more.

You will also have an unmatched opportunity to network with over 400 of your 
peers from the open source search ecosystem, in all sectors of government, 
universities, start-ups, Fortune 1000 companies, and the developer and user 
community. 

Register at: http://us.ootoweb.com/luceneregistration

P.S. There are also a few free tickets left for the San Francisco Giants vs. 
Florida Marlins game on May 24!


Michael Bohlig | Lucid Imagination 
Enterprise Marketing 
p +1 650 353 4057 x132 
m+1 650 703 8383 
www.lucidimagination.com 





Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr
Hi Lance,

I used XPathEntityProcessor with attribut xsl and generate a xml-File in the 
form of the standard Solr update schema.
I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
use one thread.

My tests with a collection of 350T Document:
1. use of XPathRecordReader without xslt:  28min
2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min  
2. use of XPathEntityProcessor with saxon-xslt: 36min  


Best regards
  Karsten



 Lance 
 There is an option somewhere to use the full XML DOM implementation
 for using xpaths. The purpose of the XPathEP is to be as simple and
 dumb as possible and handle most cases: RSS feeds and other open
 standards.
 
 Search for xsl(optional)
 
 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
 
--karsten
  Hi Folks,
 
  does anyone improve DIH XPathRecordReader to deal with nested xpaths?
  e.g.
  data-config.xml with
   entity .. processor=XPathEntityProcessor ..
   field column=title xpath=//body/h1/
   field column=alltext” xpath=//body flatten=true/
  and the XML stream contains
   /html/body/h1...
  will only fill field “alltext” but field “title” will be empty.
 
  This is a known issue from 2009
 
 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
 
  So three questions:
  1. How to fill a “search over all”-Field without nested xpaths?
    (schema.xml  copyField source=* dest=alltext/ will not help,
 because we lose the original token order)
  2. Does anyone try to improve XPathRecordReader to deal with nested
 xpaths?
  3. Does anyone else need this feature?
 
 
  Best regards
   Karsten
 

http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html


Will Slaves Pileup Replication Requests?

2011-04-11 Thread Parker Johnson

What is the slave replication behavior if a replication request to pull
indexes takes longer than the replication interval itself?

Anotherwords, if my replication interval is set to be every 30 seconds,
and my indexes are significantly large enough to take longer than 30
seconds to transfer, is the slave smart enough to not send another
replication request if one is already in progress?


-Parker




Re: Will Slaves Pileup Replication Requests?

2011-04-11 Thread Green, Larry (CMG - Digital)
Yes. It will wait whatever the replication interval is after the most recent 
replication completes before attempting again.

On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote:

 
 What is the slave replication behavior if a replication request to pull
 indexes takes longer than the replication interval itself?
 
 Anotherwords, if my replication interval is set to be every 30 seconds,
 and my indexes are significantly large enough to take longer than 30
 seconds to transfer, is the slave smart enough to not send another
 replication request if one is already in progress?
 
 
 -Parker
 
 



Re: Exact match on a field with stemming

2011-04-11 Thread Otis Gospodnetic
Hi,

Using quoted means use this as a phrase, not use this as a literal. :)
I think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming
 
 Hi all,
 
 Is there a way to perform an exact match query on a field that  has stemming 
enable by using the standard /select handler?
 
 I thought  that putting word inside double-quotes would enable this behaviour 
but if I  query my field with a single word like “manager”
 I am receiving results  containing the word “management”
 
 I know I can use a CopyField with  different types but that would double the 
size of my index… Is there an  alternative?
 
 Thanks



Re: Will Slaves Pileup Replication Requests?

2011-04-11 Thread Parker Johnson

Thanks Larry.

-Parker

On 4/11/11 12:14 PM, Green, Larry (CMG - Digital)
larry.gr...@cmgdigital.com wrote:

Yes. It will wait whatever the replication interval is after the most
recent replication completes before attempting again.

On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote:

 
 What is the slave replication behavior if a replication request to pull
 indexes takes longer than the replication interval itself?
 
 Anotherwords, if my replication interval is set to be every 30 seconds,
 and my indexes are significantly large enough to take longer than 30
 seconds to transfer, is the slave smart enough to not send another
 replication request if one is already in progress?
 
 
 -Parker
 
 






Question on Dismax plugin

2011-04-11 Thread Nemani, Raj
All,

I have a question on the Dismax plugin for the search handler.  I have
two test instances of Solr.  In one I am using the default search
handler.  In this case, the fields that I am working with (slug and
story) are indexed via the all_text filed and the searches are done on
the all_text field.

For the other one I have configured a search handler using the dismax
plugin as shown below.

 

requestHandler name=mydismax class=solr.SearchHandler 

lst name=defaults

 str name=defTypedismax/str

 str name=echoParamsexplicit/str

 float name=tie0.01/float

 str name=qf

story^3.0 slug^0.2

 /str

 int name=ps100/int

 str name=q.alt*:*/str

 /lst

  /requestHandler

 

To make testing easier, I only have 4 (same) documents in both indexes
with the word Obama appearing inside as described below.

 

File 1:: The word Obama appears zero times in slug field and four
times in story field

File 2:: The word Obama appears zero times in slug field and thrice in
story field

File 3:: The word Obama appears zero times in slug field and two times
in story field

File 4:: The word Obama appears One time in slug field and one time in
story field

 

 

Here is the order of the documents in the order of decreasing scores
from the search results

 

Dismax Search Handler (steadily decreasing scores):

* File 1:: The word Obama appears zero times in slug field and
four times in story field

* File 4:: The word Obama appears One time in slug field and
one time in story field

* File 2:: The word Obama appears zero times in slug field and
thrice in story field

* File 3:: The word Obama appears zero times in slug field and
two times in story field

 

Standard Search handler:

* File 1:: The word Obama appears zero times in slug field and
four times in story field

* File 2:: The word Obama appears zero times in slug field and
thrice in story field (same score as File 4 score below)

* File 4:: The word Obama appears One time in slug field and
one time in story field (same score as File 2 score above)

* File 3:: The word Obama appears zero times in slug field and
two times in story field

 

 

My question, why is dismax showing File 4:: The word Obama appears One
time in slug field and one time in story field 

ahead of 

File 2:: The word Obama appears zero times in slug field and thrice
in story field given that I have boosted these fields as shown below.


 

str name=qf

story^3.0 slug^0.2

/str

 

I would have thought that the File 4:: The word Obama appears One time
in slug field and one time in story field would have gone all the
way done in the result list.

 

Any help is appreciated

Thanks much in advance

Raj

 

 

 

 

 

 

 

 



Re: Question on Dismax plugin

2011-04-11 Thread Otis Gospodnetic
Hi Raj,

I'm guessing your slug field is much shorter and thus a match in that field has 
more weight than a match is a much longer story field.  If you omit norms for 
those fields in the schema (and reindex), I believe you will see File 4 drop to 
position #4.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Nemani, Raj raj.nem...@turner.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 4:12:52 PM
 Subject: Question on Dismax plugin
 
 All,
 
 I have a question on the Dismax plugin for the search handler.   I have
 two test instances of Solr.  In one I am using the default  search
 handler.  In this case, the fields that I am working with (slug  and
 story) are indexed via the all_text filed and the searches are done  on
 the all_text field.
 
 For the other one I have configured a search  handler using the dismax
 plugin as shown below.
 
 
 
 requestHandler name=mydismax class=solr.SearchHandler  
 
 lst name=defaults
 
   str name=defTypedismax/str
 
  str  name=echoParamsexplicit/str
 
  float  name=tie0.01/float
 
  str  name=qf
 
 story^3.0  slug^0.2
 
  /str
 
  int  name=ps100/int
 
  str  name=q.alt*:*/str
 
  /lst
 
/requestHandler
 
 
 
 To make testing easier, I only have 4  (same) documents in both indexes
 with the word Obama appearing inside as  described below.
 
 
 
 File 1:: The word Obama appears zero times in  slug field and four
 times in story field
 
 File 2:: The word Obama  appears zero times in slug field and thrice in
 story field
 
 File  3:: The word Obama appears zero times in slug field and two times
 in  story field
 
 File 4:: The word Obama appears One time in slug field  and one time in
 story field
 
 
 
 
 
 Here is the order of  the documents in the order of decreasing scores
 from the search  results
 
 
 
 Dismax Search Handler (steadily decreasing  scores):
 
 * File 1:: The word Obama appears  zero times in slug field and
 four times in story field
 
 *  File 4:: The word Obama appears One time in slug field  and
 one time in story field
 
 * File 2::  The word Obama appears zero times in slug field and
 thrice in story  field
 
 * File 3:: The word Obama appears zero  times in slug field and
 two times in story field
 
 
 
 Standard  Search handler:
 
 * File 1:: The word Obama  appears zero times in slug field and
 four times in story  field
 
 * File 2:: The word Obama appears zero  times in slug field and
 thrice in story field (same score as File 4 score  below)
 
 * File 4:: The word Obama appears One  time in slug field and
 one time in story field (same score as File 2  score above)
 
 * File 3:: The word Obama  appears zero times in slug field and
 two times in story field
 
 
 
 
 
 My question, why is dismax showing File 4:: The word Obama  appears One
 time in slug field and one time in story field 
 
 ahead  of 
 
 File 2:: The word Obama appears zero times in slug field and  thrice
 in story field given that I have boosted these fields as shown  below.
 
 
 
 
 str name=qf
 
  story^3.0 slug^0.2
 
 /str
 
 
 
 I  would have thought that the File 4:: The word Obama appears One time
 in  slug field and one time in story field would have gone all the
 way done  in the result list.
 
 
 
 Any help is appreciated
 
 Thanks much  in advance
 
 Raj
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


Re: Mongo REST interface and full data import

2011-04-11 Thread andrew_s
Thank you guys for your answers.
I didn't recognise that it will be so easy to do it and example from
http://wiki.apache.org/solr/UpdateJSON#Example works perfectly for me.

Regards,
Andrew

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2808507.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThis match

2011-04-11 Thread Brian Lamb
Does anyone have any thoughts on this one?

On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb brian.l...@journalexperts.comwrote:

 I've looked at both wiki pages and none really clarify the difference
 between these two. If I copy and paste an existing index value for field and
 do an mlt search, it shows up under match but not results. What is the
 difference between these two?


 On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:

 Actually, what is the difference between match and response? It seems
 that match always returns one result but I've thrown a few cases at it where
 the score of the highest response is higher than the score of match. And
 then there are cases where the match score dwarfs the highest response
 score.


 On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb brian.l...@journalexperts.com
  wrote:

 Hi all,

 I've been using MoreLikeThis for a while through select:

 http://localhost:8983/solr/select/?q=field:more like
 thismlt=truemlt.fl=fieldrows=100fl=*,score

 I was looking over the wiki page today and saw that you can also do this:

 http://localhost:8983/solr/mlt/?q=field:more like
 thismlt=truemlt.fl=fieldrows=100

 which seems to run faster and do a better job overall. When the results
 are returned, they are formatted like this:

 response
   lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
   /lst
   result name=match numFound=24 start=0 maxScore=3.0438285
 doc
   float name=score3.0438285/float
   str name=id5/str
 /doc
   /result
   result name=response numFound=4077 start=0
 maxScore=0.12775186
 doc
   float name=score0.1125823/float
   str name=id3/str
 /doc
 doc
   float name=score0.10231556/float
   str name=id8/str
 /doc
  ...
   /result
 /response

 It seems that it always returns just 1 response under match and response
 is set by the rows parameter. How can I get more than one result under
 match?

 What I'm trying to do here is whatever is set for field:, I would like to
 return the top 100 records that match that search based on more like this.

 Thanks,

 Brian Lamb






Too many open files exception related to solrj getServer too often?

2011-04-11 Thread cyang2010
Hi,

I get this solrj error in development environment.

org.apache.solr.client.solrj.SolrServerException: java.net.SocketException:
Too many open files

At the time there was no reindexing or any write to the index.   There were
only different queries genrated using solrj to hit solr server:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);
server.setSoTimeout(1000); // socket read timeout
server.setConnectionTimeout(1000);
server.setDefaultMaxConnectionsPerHost(100);
server.setMaxTotalConnections(100);
...
QueryResponse rsp = server.query(solrQuery);

I did NOT share reference of solrj CommonsHttpSolrServer among requests.  
So every http request will obtain a solj solr server instance and run query
on it.  

The question is:

1. Should solrj client share one instance of CommonHttpSolrServer?   Why? 
Is every CommonHttpSolrServer matched to one solr/lucene reader?  But from
the source code, it just shows it related to one apache http client.

2. Is TooManyOpenFiles exeption related to my possible wrong usage of
CommonHttpSolrServer?

3. server.query(solrQuery) throws SolrServerException.  How can concurrent
solr queries triggers Too many open file exception?


Look forward to your input.  Thanks,



cy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-open-files-exception-related-to-solrj-getServer-too-often-tp2808718p2808718.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon
I'm curious to know why Solr is not respecting the phrase.
If it consider manager as a phrase... shouldn't it return only document 
containing that phrase?

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means use this as a phrase, not use this as a literal. :) I 
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming
 
 Hi all,
 
 Is there a way to perform an exact match query on a field that  has 
stemming enable by using the standard /select handler?
 
 I thought  that putting word inside double-quotes would enable this 
behaviour but if I  query my field with a single word like “manager”
 I am receiving results  containing the word “management”
 
 I know I can use a CopyField with  different types but that would 
double the size of my index… Is there an  alternative?
 
 Thanks
 



FW: Exact match on a field with stemming

2011-04-11 Thread Jonathan Rochkind

 I'm curious to know why Solr is not respecting the phrase.
 If it consider manager as a phrase... shouldn't it return only document 
 containing that phrase?

A phrase means to solr (or rather to the lucene and dismax query parsers, which 
are what understand double-quoted phrases)  these tokens in exactly this order

So a phrase of one token manager, is exactly the same as if you didn't use 
the double quotes. It's only one token, so all the tokens in this phrase in 
exactly the order specified is, well, just the same as one token without 
phrase quotes. 

If you've set up a stemmed field at indexing time, then manager and 
management are stemmed IN THE INDEX, probably to something like manag.  
There is no longer any information in the index (at least in that field) on 
what the original literal was, it's been stemmed in the index.  So there's no 
way possible for it to only match certain un-stemmed versions -- at least using 
that field. And when you enter either 'manager' or 'management' at query time, 
it is analyzed and stemmed to match that stemmed something-like manag in the 
index either way. If it didn't analyze and stem at query time, then instead the 
query would just match NOTHING, because neither 'manager' nor 'management' are 
in the index at all, only the stemmed versions. 

So, yes, double quotes are interpreted as a phrase, and only documents 
containing that phrase are returned, you got it. 


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means use this as a phrase, not use this as a literal. :) I 
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming

 Hi all,

 Is there a way to perform an exact match query on a field that  has
stemming enable by using the standard /select handler?

 I thought  that putting word inside double-quotes would enable this
behaviour but if I  query my field with a single word like “manager”
 I am receiving results  containing the word “management”

 I know I can use a CopyField with  different types but that would
double the size of my index… Is there an  alternative?

 Thanks




Re: when to change rows param?

2011-04-11 Thread Chris Hostetter

Paul: can you elaborate a little bit on what exactly your problem is?

 - what is the full component list you are using?
 - how are you changing the param value (ie: what does the code look like)
 - what isn't working the way you expect?

: I've been using my own QueryComponent (that extends the search one) 
: successfully to rewrite web-received parameters that are sent from the 
: (ExtJS-based) javascript client. This allows an amount of 
: query-rewriting, that's good. I tried to change the rows parameter there 
: (which is limit in the query, as per the underpinnings of ExtJS) but 
: it seems that this is not enough.
: 
: Which component should I subclass to change the rows parameter?

-Hoss


Re: Deduplication questions

2011-04-11 Thread Chris Hostetter

: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from str
: name=fieldsname,features,cat/str
: Is there any mechanism I could build  field dependant signatures?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss


Re: XML not coming through from nabble to Gmail

2011-04-11 Thread Michael Sokolov
I see the same problem (missing markup) in Thunderbird. Seems like 
Nabble might be the culprit?


-Mike

On 4/11/2011 8:13 AM, Erick Erickson wrote:

All:

Lately I've been seeing a lot of posts where people paste in parts of their
schema.xml or solrconfig.xml and the results are...er...disappointing. None
of the less-than or greater-than symbols show and the formatting is all over
the map.

Since some mails would come through with the XML formatted and some would be
wonky, at first I thought it was the sender, but then a pretty high
percentage came through this way. So I poked around and it seems to only be
the case that the XML is wonkified (tm) when it's comes to Gmail from
nabble, the original post on nabble has the markup and displays fine.
Behavior is the same in Chrome and Firefox BTW.

Does anyone have any insight into this? Time to complain to the nabble
folks? Do others see this with non-Gmail relays?

Thanks,
Erick





Re: Solr 1.4.1 compatible with Lucene 3.0.1?

2011-04-11 Thread Otis Gospodnetic
Hi,

I only read the short story. :)
Note that you should post questions like this on solr-user@lucene list, which 
is 
where I'm replying now.

Since you are just starting with Solr, why not grab the recently released 3.1?  
That way you'll get the latest Lucene and the latest Solr.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: RichSimon richard_si...@hms.harvard.edu
 To: solr-...@lucene.apache.org
 Sent: Mon, April 11, 2011 10:36:46 AM
 Subject: Solr 1.4.1 compatible with Lucene 3.0.1?
 
 
 Short story: I am using Lucene 3.0.1, and I'm trying to run Solr 1.4.1.  I
 get an error starting the embedded Solr server that says it cannot find  the
 method FSDirectory.getDirectory. The release notes for Solr 1.4.1 says it  is
 compatible with Lucene 2.9.3, and I see that Lucene 3.0.1 does not have  the
 FSDirectory.getDirectory method any more. Dorwngrading Lucene to 2.9.x  is
 not an option for me. What version of Solr should I use for Lucene  3.0.1?
 (We're just starting with Solr, so changing that version is not hard.)  Or,
 do I have to upgrade both Solr and  Lucene?
 
 Thanks,
 
 -Rich
 
 Here's the long story:
 I am using  Lucene 3.0.1, and I'm trying to run Solr 1.4.1. I have not used
 any other  version of Lucene. We have an existing project using Lucene 3.0.1,
 and we  want to start using Solr. When I try to initialize an embedded Solr
 server,  like so:
 
 
  String solrHome =  PATH_TO_SOLR_HOME;
 File  home = new File(solrHome);
  File solrXML = new File(home, solr.xml);
  
  coreContainer = new CoreContainer();
  coreContainer.load(solrHome, solrXML);

  embeddedSolr = new EmbeddedSolrServer(coreContainer,  SOLR_CORE);
 
 
 
 [04-08  11:48:39] ERROR CoreContainer [main]:  java.lang.NoSuchMethodError:
org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
;
 at
org.apache.solr.spelling.AbstractLuceneSpellChecker.initIndex(AbstractLuceneSpellChecker.java:186)
)
 at
org.apache.solr.spelling.AbstractLuceneSpellChecker.init(AbstractLuceneSpellChecker.java:101)
)
 ;  at
org.apache.solr.spelling.IndexBasedSpellChecker.init(IndexBasedSpellChecker.java:56)
)
 at
org.apache.solr.handler.component.SpellCheckComponent.inform(SpellCheckComponent.java:274)
)
 ;  at
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508)
  at org.apache.solr.core.SolrCore.(SolrCore.java:588)
 at  org.apache.solr.core.CoreContainer.create(CoreContainer.java:428)
  at  org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
 
 
 Looking at Google posts about this, it seemed that this can be caused by  a
 version mismatch between the Lucene version in use and the one Solr tries  to
 use. I noticed a Lucene version tag in the example solrconfig.xml that  I’m
 modifying:
 
   LUCENE_40
 
 I changing it to LUCENE_301,  changing it to LUCENE_30, and commenting it
 out, but I still get the same  error. Using
 LucenePackage.get().getImplementationVersion() shows this as the  Lucene
 version:
   
 Lucene version: 3.0.1 912433 -  2010-02-21 23:51:22
 
 I also printed my classpath and found the following  lucene  jars:
 lucene-analyzers-3.0.1.jar
 lucene-core-3.0.1.jar
 lucene-highlighter-3.0.1.jar
 lucene-memory-3.0.1.jar
 lucene-misc-2.9.3.jar
 lucene-queries-2.9.3.jar
 lucene-snowball-2.9.3.jar
 lucene-spellchecker-2.9.3.jar
 
 The  FSDirectory class is in lucene-core. I decompiled the class file in the
 jar,  and did not see a getDirectory method. Also, I used a ClassLoader
 statement  to get an instance of the FSDirectory class my code is using, and
 printed out  the methods; no getDirectory method.
 
 I gather from the Lucene Javadoc  that the getDirectory method is in
 FSDirectory for 2.4.0 and for 2.9.0, but  is gone in 3.0.1 (the version I'm
 using). 
 
 Is Lucene 3.0.1 completely  incompatible with Solr 1.4.1? Is there some way
 to use the luceneMatchVersion  tag to tell Solr what version I want to use?
 
 
 --
 View this message  in context: 
http://lucene.472066.n3.nabble.com/Solr-1-4-1-compatible-with-Lucene-3-0-1-tp2806828p2806828.html

 Sent  from the Solr - Dev mailing list archive at  Nabble.com.
 
 -
 To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 



Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-11 Thread Chris Hostetter

: I have a core with 120+ segment files and I tried partial optimize specify
: maxNumSegments=10, after the optimize the segment files reduced to 64 files;

a) the option you want to specify is maxSegments .. not maxNumSegments

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22

b) i can't reproduce this ... i just created an index with 200 segments 
and when i hit the example url from the wiki...

curl 
'http://localhost:8983/solr/update?optimize=truemaxSegments=10waitFlush=false'

...my index was correctly optimized down to 10 segments.

is it possible that you just didn't wait long enough and you were 
observing the number of segments while the optimize was still taking 
place?


-Hoss


Re: XML not coming through from nabble to Gmail

2011-04-11 Thread Chris Hostetter

: I see the same problem (missing markup) in Thunderbird. Seems like Nabble
: might be the culprit?

if someone can cite some specific examples (by email message-id, or 
subject, or date+sender, or url from nabble, or url from any public 
archive, or anything more specific then posts from nabble containing 
xml) we can check the official apache mail archive which contains the 
raw message as recieved by ezmlm., such as..

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com%3E



-Hoss


Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Joey Hanzel
Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416

On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil 
jayendra.patil@gmail.com wrote:

 The migration of Tika to the latest 0.8 version seems to have
 reintroduced the issue.

 I was able to get this working again with the following patches. (Solr
 Cell and Data Import handler)

 https://issues.apache.org/jira/browse/SOLR-2416
 https://issues.apache.org/jira/browse/SOLR-2332

 You can try these.

 Regards,
 Jayendra

 On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com
 wrote:
  Hi Gary,
 
  I have been experiencing the same problem... Unable to extract content
 from
  archive file formats.  I just tried again with a clean install of Solr
 3.1.0
  (using Tika 0.8) and continue to experience the same results.  Did you
 have
  any success with this problem with Solr 1.4.1 or 3.1.0 ?
 
  I'm using this curl command to send data to Solr.
  curl 
 
 http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true
 
  -H application/octet-stream -F  myfile=@data.zip
 
  No problem extracting single rich text documents, but archive files only
  result in the file names within the archive being indexed. Am I missing
  something else in my configuration? Solr doesn't seem to be unpacking the
  archive files. Based on the email chain associated with your first
 message,
  some people have been able to get this functionality to work as desired.
 
  On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote:
 
  Can anyone shed any light on this, and whether it could be a config
 issue?
   I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
 
  When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
 to
  the ExtractingRequestHandler, I get the following log entry (formatted
 for
  ease of reading) :
 
  SolrInputDocument[
 {
 ignored_meta=ignored_meta(1.0)={
 [stream_source_info, file, stream_content_type,
  application/octet-stream, stream_size, 260, stream_name, solr1.zip,
  Content-Type, application/zip]
 },
 ignored_=ignored_(1.0)={
 [package-entry, package-entry]
 },
 ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
 
 
  
 ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
 
 ignored_stream_size=ignored_stream_size(1.0)={260},
 ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
 ignored_content_type=ignored_content_type(1.0)={application/zip},
 docid=docid(1.0)={74},
 type=type(1.0)={5},
 text=text(1.0)={  doc2.txtdoc1.txt}
 }
  ]
 
  So, the data coming back from Tika when parsing a ZIP file does not
 include
  the file contents, only the names of the files contained therein.  I've
  tried forcing stream.type=application/zip in the CURL string, but that
 makes
  no difference.  If I specify an invalid stream.type then I get an
 exception
  response, so I know it's being used.
 
  When I send one of those txt files individually to the
  ExtractingRequestHandler, I get:
 
  SolrInputDocument[
 {
 ignored_meta=ignored_meta(1.0)={
 [stream_source_info, file, stream_content_type, text/plain,
  stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
 },
 ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
 
 
  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
 ignored_stream_size=ignored_stream_size(1.0)={30},
 ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
 ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
 docid=docid(1.0)={74},
 type=type(1.0)={5},
 text=text(1.0)={The quick brown fox  }
 }
  ]
 
  and we see the file contents in the text field.
 
  I'm using the following requestHandler definition in solrconfig.xml:
 
  !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler--
  requestHandler name=/update/extract
  class=org.apache.solr.handler.extraction.ExtractingRequestHandler
  startup=lazy
  lst name=defaults
  !-- All the main content goes into text... if you need to return
the extracted text or do highlighting, use a stored field. --
  str name=fmap.contenttext/str
  str name=lowernamestrue/str
  str name=uprefixignored_/str
 
  !-- capture link hrefs but ignore div attributes --
  str name=captureAttrtrue/str
  str name=fmap.alinks/str
  str name=fmap.divignored_/str
  /lst
  /requestHandler
 
  Is there any further debug or diagnostic I can get out of Tika to help
 me
  work out why it's only returning the file names and not the file
 contents
  when parsing a ZIP file?
 
 
  Thanks and kind regards,
  

RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon
Thanks for the clarification. This make sense.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: April-11-11 7:54 PM
To: solr-user@lucene.apache.org
Subject: FW: Exact match on a field with stemming


 I'm curious to know why Solr is not respecting the phrase.
 If it consider manager as a phrase... shouldn't it return only document
containing that phrase?

A phrase means to solr (or rather to the lucene and dismax query parsers,
which are what understand double-quoted phrases)  these tokens in exactly
this order

So a phrase of one token manager, is exactly the same as if you didn't use
the double quotes. It's only one token, so all the tokens in this phrase in
exactly the order specified is, well, just the same as one token without
phrase quotes. 

If you've set up a stemmed field at indexing time, then manager and
management are stemmed IN THE INDEX, probably to something like manag.
There is no longer any information in the index (at least in that field) on
what the original literal was, it's been stemmed in the index.  So there's
no way possible for it to only match certain un-stemmed versions -- at least
using that field. And when you enter either 'manager' or 'management' at
query time, it is analyzed and stemmed to match that stemmed something-like
manag in the index either way. If it didn't analyze and stem at query
time, then instead the query would just match NOTHING, because neither
'manager' nor 'management' are in the index at all, only the stemmed
versions. 

So, yes, double quotes are interpreted as a phrase, and only documents
containing that phrase are returned, you got it. 


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means use this as a phrase, not use this as a literal. :) I
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
search :: http://search-lucene.com/



- Original Message 
 From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com
 To: solr-user@lucene.apache.org
 Sent: Mon, April 11, 2011 2:55:04 PM
 Subject: Exact match on a field with stemming

 Hi all,

 Is there a way to perform an exact match query on a field that  has 
stemming enable by using the standard /select handler?

 I thought  that putting word inside double-quotes would enable this 
behaviour but if I  query my field with a single word like manager
 I am receiving results  containing the word management

 I know I can use a CopyField with  different types but that would 
double the size of my index. Is there an  alternative?

 Thanks


=



Re: MoreLikeThis match

2011-04-11 Thread Mike Mattozzi
Match is the document that's the top result of the query (q param)
that you specify.

Response is the list of documents that are similar to the 'match' document.

-Mike

On Mon, Apr 11, 2011 at 4:55 PM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Does anyone have any thoughts on this one?

 On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb 
 brian.l...@journalexperts.comwrote:

 I've looked at both wiki pages and none really clarify the difference
 between these two. If I copy and paste an existing index value for field and
 do an mlt search, it shows up under match but not results. What is the
 difference between these two?


 On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb 
 brian.l...@journalexperts.comwrote:

 Actually, what is the difference between match and response? It seems
 that match always returns one result but I've thrown a few cases at it where
 the score of the highest response is higher than the score of match. And
 then there are cases where the match score dwarfs the highest response
 score.


 On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb brian.l...@journalexperts.com
  wrote:

 Hi all,

 I've been using MoreLikeThis for a while through select:

 http://localhost:8983/solr/select/?q=field:more like
 thismlt=truemlt.fl=fieldrows=100fl=*,score

 I was looking over the wiki page today and saw that you can also do this:

 http://localhost:8983/solr/mlt/?q=field:more like
 thismlt=truemlt.fl=fieldrows=100

 which seems to run faster and do a better job overall. When the results
 are returned, they are formatted like this:

 response
   lst name=responseHeader
     int name=status0/int
     int name=QTime1/int
   /lst
   result name=match numFound=24 start=0 maxScore=3.0438285
     doc
       float name=score3.0438285/float
       str name=id5/str
     /doc
   /result
   result name=response numFound=4077 start=0
 maxScore=0.12775186
     doc
       float name=score0.1125823/float
       str name=id3/str
     /doc
     doc
       float name=score0.10231556/float
       str name=id8/str
     /doc
  ...
   /result
 /response

 It seems that it always returns just 1 response under match and response
 is set by the rows parameter. How can I get more than one result under
 match?

 What I'm trying to do here is whatever is set for field:, I would like to
 return the top 100 records that match that search based on more like this.

 Thanks,

 Brian Lamb







Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread Lance Norskog
The DIH has multi-threading. You can have one thread fetching files
and then give them to different threads.

On Mon, Apr 11, 2011 at 11:40 AM,  karsten-s...@gmx.de wrote:
 Hi Lance,

 I used XPathEntityProcessor with attribut xsl and generate a xml-File in 
 the form of the standard Solr update schema.
 I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
 use one thread.

 My tests with a collection of 350T Document:
 1. use of XPathRecordReader without xslt:  28min
 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min
 2. use of XPathEntityProcessor with saxon-xslt: 36min


 Best regards
  Karsten



  Lance
 There is an option somewhere to use the full XML DOM implementation
 for using xpaths. The purpose of the XPathEP is to be as simple and
 dumb as possible and handle most cases: RSS feeds and other open
 standards.

 Search for xsl(optional)

 http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

 --karsten
  Hi Folks,
 
  does anyone improve DIH XPathRecordReader to deal with nested xpaths?
  e.g.
  data-config.xml with
   entity .. processor=XPathEntityProcessor ..
   field column=title xpath=//body/h1/
   field column=alltext” xpath=//body flatten=true/
  and the XML stream contains
   /html/body/h1...
  will only fill field “alltext” but field “title” will be empty.
 
  This is a known issue from 2009
 
 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
 
  So three questions:
  1. How to fill a “search over all”-Field without nested xpaths?
    (schema.xml  copyField source=* dest=alltext/ will not help,
 because we lose the original token order)
  2. Does anyone try to improve XPathRecordReader to deal with nested
 xpaths?
  3. Does anyone else need this feature?
 
 
  Best regards
   Karsten
 

 http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html




-- 
Lance Norskog
goks...@gmail.com


Re: Solr under Tomcat

2011-04-11 Thread Lance Norskog
Hi Mike-

Please start a new thread for this.

On Mon, Apr 11, 2011 at 2:47 AM, Mike satish01sud...@gmail.com wrote:
 Hi All,

 I have installed solr instance on tomcat6. When i tried to index the PDF
 file i was able to see the response:


 0
 479


 Query:
 http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true

 But when i tried to search the content in the pdf i could not get any
 results:



 0
 2
 −

 on
 0
 struts
 10
 2.2





 Could you please let me know if I am doing anything wrong. It works fine
 when i tried with default jetty server prior to integrating on the tomcat6.

 I have followed installation steps from
 http://wiki.apache.org/solr/SolrTomcat
 (Tomcat on Windows Single Solr app).

 Thanks,
 Mike



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Indexing Best Practice

2011-04-11 Thread Lance Norskog
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.

Lance

On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
campbell.sh...@gmail.com wrote:
 If it's of any help I've split the processing of PDF files from the
 indexing. I put the PDF content into a text file (but I guess you could load
 it into a database) and use that as part of the indexing.  My processing of
 the PDF files also compares timestamps on the document and the text file so
 that I'm only processing documents that have changed.

 I am a newbie so perhaps there's more sophisticated approaches.

 Hope that helps.
 Shaun

 On 11 April 2011 07:20, Darx Oman darxo...@gmail.com wrote:

 Hi guys

 I'm wondering how to best configure solr to fulfills my requirements.

 I'm indexing data from 2 data sources:
 1- Database
 2- PDF files (password encrypted)

 Every file has related information stored in the database.  Both the file
 content and the related database fields must be indexed as one document in
 solr.  Among the DB data is *per-user* permissions for every document.

 The file contents nearly never change, on the other hand, the DB data and
 especially the permissions change very frequently which require me to
 re-index everything for every modified document.

 My problem is in process of decrypting the PDF files before re-indexing
 them
 which takes too much time for a large number of documents, it could span to
 days in full re-indexing.

 What I'm trying to accomplish is eliminating the need to re-index the PDF
 content if not changed even if the DB data changed.  I know this is not
 possible in solr, because solr doesn't update documents.

 So how to best accomplish this:

 Can I use 2 indexes one for PDF contents and the other for DB data and have
 a common id field for both as a link between them, *and results are treated
 as one Document*?





-- 
Lance Norskog
goks...@gmail.com


Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Lance Norskog
Ah! Did you set the UTF-8 parameter in Tomcat?

On Mon, Apr 11, 2011 at 2:49 AM, Mike satish01sud...@gmail.com wrote:
 Hi Roy,

 Thank you for the quick reply. When i tried to index the PDF file i was able
 to see the response:


 0
 479



 Query:
 http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true

 But when i tried to search the content in the pdf i could not get any
 results:



 0
 2
 −

 on
 0
 struts
 10
 2.2





 Could you please let me know if I am doing anything wrong. It works fine
 when i tried with default jetty server prior to integrating on the tomcat6.

 I have followed installation steps from
 http://wiki.apache.org/solr/SolrTomcat
 (Tomcat on Windows Single Solr app).

 Thanks,
 Mike



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Solr 3.1 performance compared to 1.4.1

2011-04-11 Thread Lance Norskog
Marius: I have copied the configuration from 1.4.1 to the 3.1.

Does the Directory implementation show up in the JMX beans? In
admin/statistics.jsp ? Or the Solr startup logs? (Sorry, don't have a
Solr available.)

Yonik:
 What platform are you on?  I believe the Lucene Directory
 implementation now tries to be smarter (compared to lucene 2.9) about
 picking the best default (but it may not be working out for you for
 some reason)

Lance

On Sun, Apr 10, 2011 at 12:46 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt
 pionw...@gmail.com wrote:
 Hello !

 I'm new to the list, have been using SOLR for roughly 6 months and love it.

 Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation
 (Ubuntu server, same JVM params). I have copied the configuration from 1.4.1
 to the 3.1.
 Both version are running fine, but one thing ive noticed, is that the QTime
 on 3.1, is much slower for initial searches than on the (currently
 production) 1.4.1 installation.

 For example:

 Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime
 returns 371
 Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime
 returns 59

 Using debugQuery=true, i can see that the main time is spend in the query
 component itself (org.apache.solr.handler.component.QueryComponent).

 Can someone explain this, and how can i analyze this further ? Does it take
 time to build up a decent query, so could i switch to 3.1 without having to
 worry ?

 Thanks for the report... there's no reason that anything should really
 be much slower, so it would be great to get to the bottom of this!

 Is this using the same index as the 1.4.1 server, or did you rebuild it?

 Are there any other query parameters (that are perhaps added by
 default, like faceting or anything else that could take up time) or is
 this truly just a term query?

 What platform are you on?  I believe the Lucene Directory
 implementation now tries to be smarter (compared to lucene 2.9) about
 picking the best default (but it may not be working out for you for
 some reason).

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco




-- 
Lance Norskog
goks...@gmail.com


Indexing Flickr and Panaramio

2011-04-11 Thread Estrada Groups
Has anyone tried doing this? Got any tips for someone getting started?

Thanks,
Adam

Sent from my iPhone


Re: Clarifying fetchindex command

2011-04-11 Thread Mark Miller
Looking at the code, issuing a fetchindex will cause the fetch to occur right 
away, with no respect for polling.

- Mark

On Apr 11, 2011, at 12:37 PM, Otis Gospodnetic wrote:

 Hi,
 
 Can one actually *force* replication of the index from the master without a 
 commit being issued on the master since the last replication?
 
 I do see Force a fetchindex on slave from master command: 
 http://slave_host:port/solr/replication?command=fetchindex; on 
 http://wiki.apache.org/solr/SolrReplication#HTTP_API, but that feels more 
 like 
 force the replication *now* instead of waiting for the slave to poll the 
 master than force the replication even if there is no new commit point and 
 no 
 new index version on the master.  Which one is it, really?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org