Re: how to deal with virtual collection in solr?

2010-09-03 Thread Jan Høydahl / Cominvent
You did not supply your actual query. Try to add a q=foobar parameter, also 
you don't need a  before shards since you have the ?.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. sep. 2010, at 20.14, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thank you, Jan. Unfortunately I got following exception when I use 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
  . 
 
 *
 Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at 
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at 
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 *
 
 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
 Sent: Tuesday, August 31, 2010 2:15 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 Hi,
 
 If you have multiple cores defined in your solr.xml you need to issue your 
 queries to one of the cores. Below it seems as if you are lacking core name. 
 Try instead:
 
   
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
 
 And as Lance pointed out, make sure your XML files conform to the Solr XML 
 format (http://wiki.apache.org/solr/UpdateXmlMessages).
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com
 
 On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
 
 Thank you, Jan Høydahl. 
 
 I used 
 http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a error Missing solr core name in path. I have aapublic and 
 aaprivate cores. I also got a error if I used 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a null exception java.lang.NullPointerException. 
 
 My collections are xml files. Please let me if I can use the following way 
 you suggested.
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.xml
 
 Thanks so much as always!
 Xiaohui 
 
 
 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
 Sent: Friday, August 27, 2010 7:42 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 Hi,
 
 Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, 
 please use this style:
 shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
 
 
 However, since schema is the same, I'd opt for one index 

RE: how to deal with virtual collection in solr?

2010-09-03 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help, Jan Høydahl.

Have a great weekend!
Xiaohui

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com]
Sent: Friday, September 03, 2010 3:46 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

You did not supply your actual query. Try to add a q=foobar parameter, also 
you don't need a  before shards since you have the ?.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. sep. 2010, at 20.14, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thank you, Jan. Unfortunately I got following exception when I use 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
  .

 *
 Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at 
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
at 
 org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at 
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 *

 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com]
 Sent: Tuesday, August 31, 2010 2:15 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?

 Hi,

 If you have multiple cores defined in your solr.xml you need to issue your 
 queries to one of the cores. Below it seems as if you are lacking core name. 
 Try instead:

   
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/

 And as Lance pointed out, make sure your XML files conform to the Solr XML 
 format (http://wiki.apache.org/solr/UpdateXmlMessages).

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thank you, Jan Høydahl.

 I used 
 http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a error Missing solr core name in path. I have aapublic and 
 aaprivate cores. I also got a error if I used 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a null exception java.lang.NullPointerException.

 My collections are xml files. Please let me if I can use the following way 
 you suggested.
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.xml

 Thanks so much as always!
 Xiaohui


 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com]
 Sent: Friday, August 27, 2010 7:42 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how 

RE: how to deal with virtual collection in solr?

2010-09-01 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thank you, Jan. Unfortunately I got following exception when I use 
http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
 . 

*
Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at java.io.StringReader.init(StringReader.java:33)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
at 
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
*

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Tuesday, August 31, 2010 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

If you have multiple cores defined in your solr.xml you need to issue your 
queries to one of the cores. Below it seems as if you are lacking core name. 
Try instead:


http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/

And as Lance pointed out, make sure your XML files conform to the Solr XML 
format (http://wiki.apache.org/solr/UpdateXmlMessages).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thank you, Jan Høydahl. 
 
 I used 
 http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a error Missing solr core name in path. I have aapublic and 
 aaprivate cores. I also got a error if I used 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a null exception java.lang.NullPointerException. 
 
 My collections are xml files. Please let me if I can use the following way 
 you suggested.
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.xml
 
 Thanks so much as always!
 Xiaohui 
 
 
 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
 Sent: Friday, August 27, 2010 7:42 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 Hi,
 
 Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
 use this style:
 shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
 
 
 However, since schema is the same, I'd opt for one index with a collections 
 field as the filter.
 
 You can add that field to your schema, and then inject it as metadata on the 
 ExtractingRequestHandler call:
 
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.pdf
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 

Re: how to deal with virtual collection in solr?

2010-08-31 Thread Jan Høydahl / Cominvent
Hi,

If you have multiple cores defined in your solr.xml you need to issue your 
queries to one of the cores. Below it seems as if you are lacking core name. 
Try instead:


http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/

And as Lance pointed out, make sure your XML files conform to the Solr XML 
format (http://wiki.apache.org/solr/UpdateXmlMessages).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thank you, Jan Høydahl. 
 
 I used 
 http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a error Missing solr core name in path. I have aapublic and 
 aaprivate cores. I also got a error if I used 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a null exception java.lang.NullPointerException. 
 
 My collections are xml files. Please let me if I can use the following way 
 you suggested.
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.xml
 
 Thanks so much as always!
 Xiaohui 
 
 
 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
 Sent: Friday, August 27, 2010 7:42 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 Hi,
 
 Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
 use this style:
 shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
 
 
 However, since schema is the same, I'd opt for one index with a collections 
 field as the filter.
 
 You can add that field to your schema, and then inject it as metadata on the 
 ExtractingRequestHandler call:
 
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.pdf
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com
 
 On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
 
 Thanks so much for your help! I will try it.
 
 
 -Original Message-
 From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
 Sent: Thursday, August 26, 2010 2:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 I don't know about the shards, etc.
 
 However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
 can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
 
 The problem is that the version of Tika that 1.4.1 using is a very old
 version of Tika, which uses a old version of PDFBox to do its parsing.  (You
 might be able to fix the problem just by replacing the Tika jars...however I
 don't know if there have been any API changes so I can't really suggest
 that.)
 
 We didn't upgrade to trunk in order for that functionality, but it was nice
 that it started working. (The PDFs we'll be indexing won't be of later
 versions, but a test file was).
 
 On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
 xiao...@mail.nlm.nih.gov wrote:
 
 Thanks so much for your help, Jan Høydahl!
 
 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.
 
 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.
 
 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.
 
 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
 
 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
  at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
  at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
  at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
 

Re: how to deal with virtual collection in solr?

2010-08-28 Thread Lance Norskog
For XML files that are not in the Solr document upload format, you
would use the DataImportHandler.

http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Look for the wikipedia example. It shows how to read XML files from
disk. You give XPath expressions for different items in the XML.

On Fri, Aug 27, 2010 at 6:04 AM, Ma, Xiaohui (NIH/NLM/LHC) [C]
xiao...@mail.nlm.nih.gov wrote:
 Thank you, Jan Høydahl.

 I used 
 http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a error Missing solr core name in path. I have aapublic and 
 aaprivate cores. I also got a error if I used 
 http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
  I got a null exception java.lang.NullPointerException.

 My collections are xml files. Please let me if I can use the following way 
 you suggested.
 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.xml

 Thanks so much as always!
 Xiaohui


 -Original Message-
 From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com]
 Sent: Friday, August 27, 2010 7:42 AM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?

 Hi,

 Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
 use this style:
 shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


 However, since schema is the same, I'd opt for one index with a collections 
 field as the filter.

 You can add that field to your schema, and then inject it as metadata on the 
 ExtractingRequestHandler call:

 curl 
 http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
  -F fi...@myfile.pdf

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Training in Europe - www.solrtraining.com

 On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thanks so much for your help! I will try it.


 -Original Message-
 From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com]
 Sent: Thursday, August 26, 2010 2:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?

 I don't know about the shards, etc.

 However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
 can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

 The problem is that the version of Tika that 1.4.1 using is a very old
 version of Tika, which uses a old version of PDFBox to do its parsing.  (You
 might be able to fix the problem just by replacing the Tika jars...however I
 don't know if there have been any API changes so I can't really suggest
 that.)

 We didn't upgrade to trunk in order for that functionality, but it was nice
 that it started working. (The PDFs we'll be indexing won't be of later
 versions, but a test file was).

 On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
 xiao...@mail.nlm.nih.gov wrote:

 Thanks so much for your help, Jan Høydahl!

 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.

 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.

 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.

 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32

 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
       at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
       at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
       at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
       at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
       at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
       at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
       at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
       at
 

Re: how to deal with virtual collection in solr?

2010-08-27 Thread Jan Høydahl / Cominvent
Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
use this style:
shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a collections 
field as the filter.

You can add that field to your schema, and then inject it as metadata on the 
ExtractingRequestHandler call:

curl 
http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
 -F fi...@myfile.pdf

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thanks so much for your help! I will try it.
 
 
 -Original Message-
 From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
 Sent: Thursday, August 26, 2010 2:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 I don't know about the shards, etc.
 
 However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
 can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
 
 The problem is that the version of Tika that 1.4.1 using is a very old
 version of Tika, which uses a old version of PDFBox to do its parsing.  (You
 might be able to fix the problem just by replacing the Tika jars...however I
 don't know if there have been any API changes so I can't really suggest
 that.)
 
 We didn't upgrade to trunk in order for that functionality, but it was nice
 that it started working. (The PDFs we'll be indexing won't be of later
 versions, but a test file was).
 
 On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
 xiao...@mail.nlm.nih.gov wrote:
 
 Thanks so much for your help, Jan Høydahl!
 
 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.
 
 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.
 
 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.
 
 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
 
 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
   at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
   at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
   at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
   at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: org.apache.tika.exception.TikaException: Unexpected
 

RE: how to deal with virtual collection in solr?

2010-08-27 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much, I really appreciate your help!
Have a great weekend!
Xiaohui 

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Friday, August 27, 2010 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
use this style:
shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a collections 
field as the filter.

You can add that field to your schema, and then inject it as metadata on the 
ExtractingRequestHandler call:

curl 
http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
 -F fi...@myfile.pdf

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thanks so much for your help! I will try it.
 
 
 -Original Message-
 From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
 Sent: Thursday, August 26, 2010 2:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 I don't know about the shards, etc.
 
 However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
 can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
 
 The problem is that the version of Tika that 1.4.1 using is a very old
 version of Tika, which uses a old version of PDFBox to do its parsing.  (You
 might be able to fix the problem just by replacing the Tika jars...however I
 don't know if there have been any API changes so I can't really suggest
 that.)
 
 We didn't upgrade to trunk in order for that functionality, but it was nice
 that it started working. (The PDFs we'll be indexing won't be of later
 versions, but a test file was).
 
 On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
 xiao...@mail.nlm.nih.gov wrote:
 
 Thanks so much for your help, Jan Høydahl!
 
 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.
 
 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.
 
 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.
 
 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
 
 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
   at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
   at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
   at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
   at org.mortbay.jetty.Server.handle(Server.java:285)
   at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
   at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
   at 

RE: how to deal with virtual collection in solr?

2010-08-27 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thank you, Jan Høydahl. 

I used 
http://localhost:8983/solr/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
 I got a error Missing solr core name in path. I have aapublic and aaprivate 
cores. I also got a error if I used 
http://localhost:8983/solr/aapublic/select?shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/.
 I got a null exception java.lang.NullPointerException. 

My collections are xml files. Please let me if I can use the following way you 
suggested.
curl 
http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
 -F fi...@myfile.xml

Thanks so much as always!
Xiaohui 


-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Friday, August 27, 2010 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please 
use this style:
shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a collections 
field as the filter.

You can add that field to your schema, and then inject it as metadata on the 
ExtractingRequestHandler call:

curl 
http://localhost:8983/solr/update/extract?literal.collection=aaprivateliteral.id=doc1commit=true;
 -F fi...@myfile.pdf

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 Thanks so much for your help! I will try it.
 
 
 -Original Message-
 From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
 Sent: Thursday, August 26, 2010 2:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to deal with virtual collection in solr?
 
 I don't know about the shards, etc.
 
 However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
 can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
 
 The problem is that the version of Tika that 1.4.1 using is a very old
 version of Tika, which uses a old version of PDFBox to do its parsing.  (You
 might be able to fix the problem just by replacing the Tika jars...however I
 don't know if there have been any API changes so I can't really suggest
 that.)
 
 We didn't upgrade to trunk in order for that functionality, but it was nice
 that it started working. (The PDFs we'll be indexing won't be of later
 versions, but a test file was).
 
 On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
 xiao...@mail.nlm.nih.gov wrote:
 
 Thanks so much for your help, Jan Høydahl!
 
 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.
 
 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.
 
 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.
 
 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
 
 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
   at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
   at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
   at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
   at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
   at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
   at
 

RE: how to deal with virtual collection in solr?

2010-08-26 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help, Jan Høydahl!

I made multiple cores (aa public, aa private, bb public and bb private). I knew 
how to query them individually. Please tell me if I can do a combinations 
through shards parameter now. If yes, I tried to append shards=aapub,bbpub 
after query string. Unfortunately it didn't work.

Actually all of content is the same. I don't have collection field in xml 
files. Please tell me how I can set a collection field in schema and simply 
search collection through filter.

I used curl to index pdf files. I use Solr 1.4.1. I got the following error 
when I index pdf with version 1.5 and 1.6.

*
html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.pdfpar...@134ae32
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 22 more
Caused by: java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at 
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 24 more
/pre
pRequestURI=/solr/lhcpdf/update/extract/ppismalla 
href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/   
 
br/  
***


-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Wednesday, August 25, 2010 4:34 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr? 

 1. Currently we use Verity and have more than 20 collections, each collection 
 has a index for public items and a index for private items. So there are 
 virtual collections which point to each collection and a virtual collection 
 which points to all. For example, we have AA and BB collections.
 
 AA 

Re: how to deal with virtual collection in solr?

2010-08-26 Thread Thomas Joiner
I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov wrote:

 Thanks so much for your help, Jan Høydahl!

 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.

 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.

 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.

 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32

 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: org.apache.tika.exception.TikaException: Unexpected
 RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 22 more
 Caused by: java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at
 org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at
 org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at
 

RE: how to deal with virtual collection in solr?

2010-08-26 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I will try it.


-Original Message-
From: Thomas Joiner [mailto:thomas.b.joi...@gmail.com] 
Sent: Thursday, August 26, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov wrote:

 Thanks so much for your help, Jan Høydahl!

 I made multiple cores (aa public, aa private, bb public and bb private). I
 knew how to query them individually. Please tell me if I can do a
 combinations through shards parameter now. If yes, I tried to append
 shards=aapub,bbpub after query string. Unfortunately it didn't work.

 Actually all of content is the same. I don't have collection field in xml
 files. Please tell me how I can set a collection field in schema and
 simply search collection through filter.

 I used curl to index pdf files. I use Solr 1.4.1. I got the following error
 when I index pdf with version 1.5 and 1.6.

 *
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 500 /title
 /head
 bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32

 org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.pdf.pdfpar...@134ae32
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
 Caused by: org.apache.tika.exception.TikaException: Unexpected
 RuntimeException from org.apache.tika.parser.pdf.pdfpar...@134ae32
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 22 more
 Caused by: java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at
 org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at
 

Re: how to deal with virtual collection in solr?

2010-08-25 Thread Walter Underwood
On Aug 25, 2010, at 12:18 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 I just started to investigate Solr several weeks ago. Our current project 
 uses Verity search engine which is commercial product and the company is out 
 of business. 


Verity is not out of business. They were acquired by Autonomy.

wunder
--
Walter Underwood





RE: how to deal with virtual collection in solr?

2010-08-25 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thank you for letting me know. Does Autonomy still support Verity search 
engine? 


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 25, 2010 3:41 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr? 

On Aug 25, 2010, at 12:18 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

 I just started to investigate Solr several weeks ago. Our current project 
 uses Verity search engine which is commercial product and the company is out 
 of business. 


Verity is not out of business. They were acquired by Autonomy.

wunder
--
Walter Underwood





Re: how to deal with virtual collection in solr?

2010-08-25 Thread Jan Høydahl / Cominvent
 1. Currently we use Verity and have more than 20 collections, each collection 
 has a index for public items and a index for private items. So there are 
 virtual collections which point to each collection and a virtual collection 
 which points to all. For example, we have AA and BB collections.
 
 AA virtual collection -- (AA index for public items and AA index for private 
 items).
 BB virtual collection -- (BB index for public items and BB index for private 
 items).
 All virtual collection -- (AA index for public items and AA index for 
 private items, BB index for public items and BB index for private items).
 
 Would you please tell me what I should do for this if I use Solr?

There are multiple ways to solve this, depending on the nature of your 
collections. If they have somewhat different schemas, a natural choice would be 
to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now you 
can query them individually or in combinations through the shards parameter. 
From next Solr version you can use virtual collections for the shard parameter, 
e.g. shards=AA,BB etc. (See 
http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)

If all your content is (roughly) the same kind of data, you could also solve 
your virtual collection issue through a collection field in your schema, and 
simply select collection through filters: fq=collection:AA. You could even 
write a Search Component which translates a collection= parameter in the 
request into the correct filters if you want to hide this implementation to the 
front ends.

 2. Our project has different kind format files I need index them. For 
 example, xml files, pdf files and text files. Is it possible for Solr to 
 return a search result from all?

Sure. PDF and text files can be indexed through the ExtractingRequestHandler. 
XML can be indexed from XMLUpdateHandler or DataImportHandler. Solr uses Apache 
Tika internally to extract text from PDFs and other rich document formats.

 
 3. I got a error when I index pdf files which are version 1.5 or 1.6. Would 
 you please tell me if there is a patch to fix it?

How did you try to index these PDFs? What version of Solr are you using? 
Exactly what error message did you get?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com