Unknown query parser 'terms' with TermsComponent defined
Hi, We've encountered a strange situation, I'm hoping someone might be able to shed some light. We're using Solr 4.9 deployed in Tomcat 7. We build a query that has these params: 'params'={ 'fl'='id', 'sort'='system_create_dtsi asc', 'indent'='true', 'start'='0', 'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms f=id}ft849m81z)', 'qt'='standard', 'wt'='ruby', 'rows'=['1', '1000']}}, And it responds with an error message 'error'={ 'msg'='Unknown query parser \'terms\'', 'code'=400}} The terms component is defined in solrconfig.xml: searchComponent name=termsComponent class=solr.TermsComponent / requestHandler name=/terms class=solr.SearchHandler lst name=defaults bool name=termstrue/bool /lst arr name=components strtermsComponent/str /arr /requestHandler And the Standard Response Handler is defined: requestHandler name=standard class=solr.SearchHandler lst name= defaults str name=echoParamsexplicit/str str name=defTypelucene /str /lst /requestHandler In case its useful, we have luceneMatchVersion4.9/luceneMatchVersion Why would we be getting the Unknown query parser \'terms\' error? Thanks, Tricia
Re: Unknown query parser 'terms' with TermsComponent defined
Thanks Hoss! It's obvious what the problem(s) are when you lay it all out that way. On Tue, Aug 25, 2015 at 12:14 PM, Chris Hostetter hossman_luc...@fucit.org wrote: 1) The terms Query Parser (TermsQParser) has nothing to do with the TermsComponent (the first is for quering many distinct terms, the later is for requesting info about low level terms in your index) https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser https://cwiki.apache.org/confluence/display/solr/The+Terms+Component 2) TermsQParser (which is what you are trying to use with the {!terms... query syntax) was not added to Solr until 4.10 3) based on your example query, i'm pretty sure what you want is the TermQParser: term (singular, no s) ... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser {!term f=id}ft849m81z : We've encountered a strange situation, I'm hoping someone might be able to : shed some light. We're using Solr 4.9 deployed in Tomcat 7. ... : 'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms f=id}ft849m81z)', ... : 'msg'='Unknown query parser \'terms\'', : 'code'=400}} ... : The terms component is defined in solrconfig.xml: : : searchComponent name=termsComponent class=solr.TermsComponent / -Hoss http://www.lucidworks.com/
Re: Advice on highlighting
Hi Craig, Have you seen SOLR-4722 (https://issues.apache.org/jira/browse/SOLR-4722)? This was my attempt at something similar. Regards, Tricia On Fri, Sep 12, 2014 at 2:23 PM, Craig Longman clong...@iconect.com wrote: In order to take our Solr usage to the next step, we really need to improve its highlighting abilities. What I'm trying to do is to be able to write a new component that can return the fields that matched the search (including numeric fields) and the start/end positions for the alphanumeric matches. I see three different approaches take, either way will require making some modifications to the lucene/solr parts, as it just does not appear to be doable as a completely stand alone component. 1) At initial search time. This seemed like a good approach. I can follow IndexSearcher creating the TermContext that parses through AtomicReaderContexts to see if it contains a match and then adds it to the contexts available for later. However, at this point, inside SegmentTermsEnum.seekExact() it seems like Solr is not really looking for matching terms as such, it's just scanning what looks like the raw index. So, I don't think I can easily extract term positions at this point. 2) Write a odified HighlighterComponent. We have managed to get phrases to highlight properly, but it seems like getting the full field matches would be more difficult in this module, however, because it does its highlighting oblivious to any other criteria, we can't use it as is. For example, this search: (body:large+AND+user_id:7)+OR+user_id:346 Will highlight large in records that have user_id = 346 when technically (for our purposes at least) it should not be considered a hit because the large was accompanied by the user_id = 7 criteria. It's not immediately clear to me how difficult it would be to change this. 3) Make a modified DebugComponent and enhance the existing explain() methods (in the query types we require it at least) to include more information such as the start/end positions of the term that was hit. I'm exploring this now, but I don't easily see how I can figure out what those positions might be from the explain() information. Any pointers on how, at the point that TermQuery.explain() is being called that I can figure out which indexed token was the actual hit on? Craig Longman C++ Developer iCONECT Development, LLC 519-645-1663 This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
How to sync lib directory in SolrCloud?
Hi, I have an existing collection that I'm trying to add to a new SolrCloud. This collection has all the normal files in conf but also has a lib directory to support the filters schema.xml uses. wget https://github.com/projectblacklight/blacklight-jetty/archive/v4.9.0.zip unzip v4.9.0.zip I add the configuration to Zookeeper cd /solr-4.9.0/example/scripts cloud-scripts/zkcli.sh -cmd upconfig -confname blacklight -zkhost zk1:2181,zk2:2181,zk3:2181 -confdir ~/blacklight-jetty-4.9.0/solr/blacklight-core/conf/ I try to create the collection curl http://solr1:8080/solr/admin/collections?action=CREATEname=blacklightnumShards=3collection.configName=blacklightreplicationFactor=2maxShardsPerNode=2 but it looks like the jars in the lib directory aren't available and this is what is causing my collection creation to fail. I guess this makes sense because it's not one of the files that I added to Zookeeper to share. How do I share the lib directory via Zookeeper? Thanks, Tricia [pjenkins@solr1 scripts]$ cloud-scripts/zkcli.sh -cmd upconfig -zkhost zk1:2181,zk2:2181,zk3:2181 -confdir ~/blacklight-jetty-4.9.0/solr/blacklight-core/conf/ -confname blacklight INFO - 2014-07-31 09:28:06.289; org.apache.zookeeper.Environment; Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT INFO - 2014-07-31 09:28:06.292; org.apache.zookeeper.Environment; Client environment:host.name=solr1.library.ualberta.ca INFO - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client environment:java.version=1.7.0_65 INFO - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client environment:java.vendor=Oracle Corporation INFO - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre INFO - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client
Re: Changing Cache Properties after Indexing
You're both completely right. There isn't any issue with indexing with large cache settings. I ran the same indexing job five times, twice with large cache and twice with the default values. I threw out the first job because no matter if it's cached or uncached it runs ~2x slower. This must have been the observation I based my incorrect caching notion on. I unloaded with delete of the data directory and reloaded the core each time. I'm using DIH with the FileEntityProcessor and PlainTextEnityProcessor to index ~11000 fulltext books. w/ cache 0:13:14.823 0:12:33.910 w/o cache 0:12:13.186 0:15:56.566 There is variation, but not anything that could be explained by the cache settings. Doh! Thanks, Tricia On Mon, Jan 13, 2014 at 6:08 PM, Shawn Heisey s...@elyograg.org wrote: On 1/13/2014 4:44 PM, Erick Erickson wrote: On the face of it, it's somewhat unusual to have the cache settings affect indexing performance. What are you seeing and how are you indexing? I think this is probably an indirect problem. Cache settings don't directly affect indexing speed, but when autoWarm values are high and NRT indexing is happening, new searchers are requested frequently and the autoWarm makes that happen slowly with a lot of resources consumed. Thanks, Shawn
Changing Cache Properties after Indexing
Hi, I've gone through steps for tuning my cache sizes and I'm very happy with the results of load testing. Unfortunately the cache settings for querying are not optimal for indexing - and in fact slow it down quite a bit. I've made the caches small by default for the indexing stage and then want to override the values using properties when used for querying. That's easy enough to do and described in SolrConfigXmlhttp://wiki.apache.org/solr/SolrConfigXml . I store these properties in a solrcore-querying.properties file. When indexing is complete I could unload the Solr core, move (mv) this file to conf/solrcore.properties and then load the Solr core and it would pick up the new properties. The only problem with that is in production I won't have access to the machine to make changes to the file system. I need to be able to do this using the Core Admin API. I can see that I can specify individual properties with the CREATE command, for instance property.solr.filterCache.size=2003232. Great! So this is possible but I still have two questions: 1. Is there a way to specify a conf/solrcore-querying.properties file to the admin/cores handler instead of each property individually? 2. The same functionality doesn't seem to be available when I call the RELOAD command. Is this expected behaviour? Should it be? Is there a better way? Thanks, Tricia
Re: DataImport Handler, writing a new EntityProcessor
Hi Mathias, I'd recommend testing one thing at a time. See if you can get it to work for one image before you try a directory of images. Also try testing using the solr-testframework using your ide (I use Eclipse) to debug rather than your browser/print statements. Hopefully that will give you some more specific knowledge of what's happening around your plugin. I also wrote an EntityProcessor plugin to read from a properties filehttps://issues.apache.org/jira/browse/SOLR-3928. Hopefully that'll give you some insight about this kind of Solr plugin and testing them. Cheers, Tricia On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux m...@itec.uni-klu.ac.atwrote: Hi all! I've got a question regarding writing a new EntityProcessor, in the same sense as the Tika one. My EntityProcessor should analyze jpg images and create document fields to be used with the LIRE Solr plugin (https://bitbucket.org/dermotte/liresolr). Basically I've taken the same approach as the TikaEntityProcessor, but my setup just indexes the first of 1000 images. I'm using a FileListEntityProcessor to get all JPEGs from a directory and then I'm handing them over (see [2]). My code for the EntityProcessor is at [1]. I've tried to use the DataSource as well as the filePath attribute, but it ends up all the same. However, the FileListEntityProcessor is able to read all the files according to the debug output, but I'm missing the link from the FileListEntityProcessor to the LireEntityProcessor. I'd appreciate any pointer or help :) cheers, Mathias [1] LireEntityProcessor http://pastebin.com/JFajkNtf [2] dataConfig http://pastebin.com/vSHucatJ -- Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Using data-config.xml from DIH in SolrJ
Hi, I just discovered UpdateProcessorFactoryhttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/package-summary.html in a big way. How did this completely slip by me? Working on two ideas. 1. I have used the DIH in a local EmbeddedSolrServer previously. I could write a ForwardingUpdateProcessorFactory to take that local update and send it to a HttpSolrServer. 2. I have code which walks the file-system to compose rough documents but haven't yet written the part that handles the templated fields and cross-walking of the source(s) to the schema. I could configure the update handler on the Solr server side to do this with the RegexReplace http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.htmland DefaultValuehttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html UpdateProcessorFactor(ies). Any thoughts on the advantages/disadvantages of these approaches? Thanks, Tricia On Thu, Nov 14, 2013 at 7:49 AM, Erick Erickson erickerick...@gmail.comwrote: There's nothing that I know of that takes a DIH configuration and uses it through SolrJ. You can use Tika directly in SolrJ if you need to parse structured documents though, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Yep, you're going to be kind of reinventing the wheel a bit I'm afraid. Best, Erick On Wed, Nov 13, 2013 at 1:55 PM, P Williams williams.tricia.l...@gmail.comwrote: Hi All, I'm building a utility (Java jar) to create SolrInputDocuments and send them to a HttpSolrServer using the SolrJ API. The intention is to find an efficient way to create documents from a large directory of files (where multiple files make one Solr document) and be sent to a remote Solr instance for update and commit. I've already solved the problem using the DataImportHandler (DIH) so I have a data-config.xml that describes the templated fields and cross-walking of the source(s) to the schema. The original data won't always be able to be co-located with the Solr server which is why I'm looking for another option. I've also already solved the problem using ant and xslt to create a temporary (and unfortunately a potentially large) document which the UpdateHandler will accept. I couldn't think of a solution that took advantage of the XSLT support in the UpdateHandler because each document is created from multiple files. Our current dated Java based solution significantly outperforms this solution in terms of disk and time. I've rejected it based on that and gone back to the drawing board. Does anyone have any suggestions on how I might be able to reuse my DIH configuration in the SolrJ context without re-inventing the wheel (or DIH in this case)? If I'm doing something ridiculous I hope you'll point that out too. Thanks, Tricia
Using data-config.xml from DIH in SolrJ
Hi All, I'm building a utility (Java jar) to create SolrInputDocuments and send them to a HttpSolrServer using the SolrJ API. The intention is to find an efficient way to create documents from a large directory of files (where multiple files make one Solr document) and be sent to a remote Solr instance for update and commit. I've already solved the problem using the DataImportHandler (DIH) so I have a data-config.xml that describes the templated fields and cross-walking of the source(s) to the schema. The original data won't always be able to be co-located with the Solr server which is why I'm looking for another option. I've also already solved the problem using ant and xslt to create a temporary (and unfortunately a potentially large) document which the UpdateHandler will accept. I couldn't think of a solution that took advantage of the XSLT support in the UpdateHandler because each document is created from multiple files. Our current dated Java based solution significantly outperforms this solution in terms of disk and time. I've rejected it based on that and gone back to the drawing board. Does anyone have any suggestions on how I might be able to reuse my DIH configuration in the SolrJ context without re-inventing the wheel (or DIH in this case)? If I'm doing something ridiculous I hope you'll point that out too. Thanks, Tricia
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
Hi Andreas, When using XPathEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessoryour DataSource must be of type DataSourceReader. You shouldn't be using BinURLDataSource, it's giving you the cast exception. Use URLDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html or FileDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.htmlinstead. I don't think you need to specify namespaces, at least you didn't used to. The other thing that I've noticed is that the anywhere xpath expression // doesn't always work in DIH. You might have to be more specific. Cheers, Tricia On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen a...@conx.ch wrote: how dum can you get. obviously quite dum... i would have to analyze the html-pages with a nested instance like this: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main entity name=htm processor=XPathEntityProcessor url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl field column=text xpath=//content / field column=h_2 xpath=//body / field column=text_nohtml xpath=//text / field column=h_1 xpath=//h:h1 / /entity /entity but i'm pretty sure the foreach is wrong and the xpath expressions. in the moment i getting the following error: Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.io.Reader On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: ok i see what your getting at but why doesn't the following work: field xpath=//h:h1 column=h_1 / field column=text xpath=/xhtml:html/xhtml:body / i removed the tiki-processor. what am i missing, i haven't found anything in the wiki? On 28. Sep 2013, at 12:28 AM, P Williams wrote: I spent some more time thinking about this. Do you really need to use the TikaEntityProcessor? It doesn't offer anything new to the document you are building that couldn't be accomplished by the XPathEntityProcessor alone from what I can tell. I also tried to get the Advanced Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to work without success. There are some obvious typos (document instead of /document) and an odd order to the pieces (dataSources is enclosed by document). It also looks like FieldStreamDataSource http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html is the one that is meant to work in this context. If Koji is still around maybe he could offer some help? Otherwise this bit of erroneous instruction should probably be removed from the wiki. Cheers, Tricia $ svn diff Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java === --- solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (revision 1526990) +++ solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (working copy) @@ -99,13 +99,13 @@ runFullImport(getConfigHTML(identity)); assertQ(req(*:*), testsHTMLIdentity); } - + private String getConfigHTML(String htmlMapper) { return dataConfig + dataSource type='BinFileDataSource'/ + document + -entity name='Tika' format='xml' processor='TikaEntityProcessor' + +entity name='Tika' format='html' processor='TikaEntityProcessor' + url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' + ((htmlMapper == null) ? : ( htmlMapper=' + htmlMapper + ')) + + field column='text'/ + @@ -114,4 +114,36 @@ /dataConfig; } + private String[] testsHTMLH1 = { + //*[@numFound='1'] + , //str[@name='h1'][contains(.,'H1 Header')] + }; + + @Test + public void testTikaHTMLMapperSubEntity() throws Exception { +runFullImport(getConfigSubEntity(identity)); +assertQ(req(*:*), testsHTMLH1); + } + + private String getConfigSubEntity(String htmlMapper) { +return +dataConfig + +dataSource type='BinFileDataSource' name='bin'/ + +dataSource type='FieldStreamDataSource' name='fld'/ + +document + +entity name='tika' processor
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
$StatementRunner.run(ThreadLeakControl.java:358) at java.lang.Thread.run(Thread.java:722) On Fri, Sep 27, 2013 at 3:55 AM, Andreas Owen a...@conx.ch wrote: i removed the FieldReaderDataSource and dataSource=fld but it didn't help. i get the following for each document: DataImportHandlerException: Exception in invoking url null Processing Document # 9 nullpointerexception On 26. Sep 2013, at 8:39 PM, P Williams wrote: Hi, Haven't tried this myself but maybe try leaving out the FieldReaderDataSource entirely. From my quick searching looks like it's tied to SQL. Did you try copying the http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example exactly? What happens when you leave out FieldReaderDataSource? Cheers, Tricia On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField=tika.text and dataField=text to no avail. the nested XPathEntityProcessor detail creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
Hi, Haven't tried this myself but maybe try leaving out the FieldReaderDataSource entirely. From my quick searching looks like it's tied to SQL. Did you try copying the http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example exactly? What happens when you leave out FieldReaderDataSource? Cheers, Tricia On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField=tika.text and dataField=text to no avail. the nested XPathEntityProcessor detail creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) ERROR - 2013-09-26 12:08:49.022;
Re: DIH field defaults or re-assigning field values
I discovered how to use the ScriptTransformerhttp://wiki.apache.org/solr/DataImportHandler#ScriptTransformer which worked to solve my problem. I had to make use of context.setSessionAttribute(...,...,'global') to store a flag for the value in the file because the script is only called if there are rows to transform and I needed to know when the default was appropriate to set in the root entity. Thanks for your suggestions Alex. Cheers, Tricia On Wed, Sep 18, 2013 at 1:19 PM, P Williams williams.tricia.l...@gmail.comwrote: Hi All, I'm using the DataImportHandler to import documents to my index. I assign one of my document's fields by using a sub-entity from the root to look for a value in a file. I've got this part working. If the value isn't in the file or the file doesn't exist I'd like the field to be assigned a default value. Is there a way to do this? I think I'm looking for a way to re-assign the value of a field. If this is possible then I can assign the default value in the root entity and overwrite it if the value is found in the sub-entity. Ideas? Thanks, Tricia
DIH field defaults or re-assigning field values
Hi All, I'm using the DataImportHandler to import documents to my index. I assign one of my document's fields by using a sub-entity from the root to look for a value in a file. I've got this part working. If the value isn't in the file or the file doesn't exist I'd like the field to be assigned a default value. Is there a way to do this? I think I'm looking for a way to re-assign the value of a field. If this is possible then I can assign the default value in the root entity and overwrite it if the value is found in the sub-entity. Ideas? Thanks, Tricia
Re: How to Manage RAM Usage at Heavy Indexing
Hi, I've been seeing the same thing on CentOS with high physical memory use with low JVM-Memory use. I came to the conclusion that this was expected behaviour. Using top I noticed that my solr user's java process has Virtual memory allocated of about twice the size of the index, actual is within the limits I set when jetty starts. I infer from this that 98% of Physical Memory is being used to cache the index. Walter, Erick and others are constantly reminding people on list to have RAM the size of the index available -- I think 98% physical memory use is exactly why. Here is an excerpt from Uwe Schindler's well written piecehttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htmlwhich explains in greater detail: *Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before.* * * Is it odd that my index is ~16GB but top shows 30GB in virtual memory? Would the extra be for the field and filter caches I've increased in size? I went through a few Java tuning steps relating to OutOfMemoryErrors when using DataImportHandler with Solr. The first thing is that when using the FileEntityProcessor for each file in the file system to be indexed an entry is made and stored in heap before any indexing actually occurs. When I started pointing this at very large directories I started running out of heap. One work-around is to divide the job up into smaller batches, but I was able to allocate more memory so that everything fit. The next thing is that with more memory allocated the limiting factor was too many open files. After allowing the solr user to open more files I was able to get past this as well. There was a sweet spot where indexing with just enough memory was slow enough that I didn't experience the too many open files error but why go slow? Now I'm able to index ~4M documents (newspaper articles and fulltext monographs) in about 7 hours. I hope someone will correct me if I'm wrong about anything I've said here and especially if there is a better way to do things. Best of luck, Tricia On Wed, Aug 28, 2013 at 12:12 PM, Dan Davis dansm...@gmail.com wrote: This could be an operating systems problem rather than a Solr problem. CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing and I would read-up up on that. The VM parameters can be tuned in /etc/sysctl.conf On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Erick; I wanted to get a quick answer that's why I asked my question as that way. Error is as follows: INFO - 2013-08-21 22:01:30.978; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[com.deviantart.reachmeh ere:http/gallery/, com.deviantart.reachstereo:http/, com.deviantart.reachstereo:http/art/SE-mods-313298903, com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/, co m.deviantart.reachthegoddess:http/art/retouched-160219962, com.deviantart.reachthegoddess:http/badges/, com.deviantart.reachthegoddess:http/favourites/, com.deviantart.reachthetop:http/ art/Blue-Jean-Baby-82204657 (1444006227844530177), com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
Re: Total Term Frequency per ResultSet in Solr 4.3 ?
Hi Tony, Have you seen the TermVectorComponenthttp://wiki.apache.org/solr/TermVectorComponent? It will return the TermVectors for the documents in your result set (note that the rows parameter matters if you want results for the whole set, the default is 10). TermVectors also must be stored for each field that you want term frequency returned for. Suppose you have the query http://localhost:8983/solr/collection1/tvrh?q=cablefl=includestv.tf=true on the example that comes packaged with Solr. Then part of the response is: lst name=termVectors str name=uniqueKeyFieldNameid/str lst name=IW-02 str name=uniqueKeyIW-02/str /lst lst name=9885A004 str name=uniqueKey9885A004/str lst name=includes lst name=32mb int name=tf1/int /lst lst name=av int name=tf1/int /lst lst name=battery int name=tf1/int /lst lst name=cable int name=tf2/int /lst lst name=card int name=tf1/int /lst lst name=sd int name=tf1/int /lst lst name=usb int name=tf1/int /lst /lst /lst lst name=3007WFP str name=uniqueKey3007WFP/str lst name=includes lst name=cable int name=tf1/int /lst lst name=usb int name=tf1/int /lst /lst /lst lst name=MA147LL/A str name=uniqueKeyMA147LL/A/str lst name=includes lst name=cable int name=tf1/int /lst lst name=earbud int name=tf1/int /lst lst name=headphones int name=tf1/int /lst lst name=usb int name=tf1/int /lst /lst /lst /lst Then you can use an XPath query like sum(//lst[@name='cable']/int[@name='tf']) where 'cable' was the term, to calculate the term frequency in the 'includes' field for the whole result set. You could extend this to get the term frequency across all fields for your result set with some alterations to the query and schema.xml configuration. Alternately you could get the response as json (wt=json) and use javascript to sum. I know this is not terribly efficient but, if I'm understanding your request correctly, it's possible. Cheers, Tricia On Thu, Jul 4, 2013 at 10:24 AM, Tony Mullins tonymullins...@gmail.comwrote: So what is the workaround for this problem ? Can it be done without changing any source code ? Thanks, Tony On Thu, Jul 4, 2013 at 8:01 PM, Yonik Seeley yo...@lucidworks.com wrote: Ah, sorry - I thought you were after docfreq, not termfreq. -Yonik http://lucidworks.com On Thu, Jul 4, 2013 at 10:57 AM, Tony Mullins tonymullins...@gmail.com wrote: Hi Yonik, With facet it didn't work. Please see the result set doc below http://localhost:8080/solr/collection2/select?fl=*,amazing_freq:termfreq%28product,%27amazing%27%29,spider_freq:termfreq%28product,%27spider%27%29fq=id%3A27q=spiderfl=*df=productwt=xmlindent=truefacet=truefacet.query=product:spiderfacet.query=product:amazingrows=20 doc str name=id27/str str name=typeMovies/str str name=formatdvd/str str name=productThe amazing spider man is amazing spider the spider/str int name=popularity1/int long name=_version_1439641369145507840/long int name=amazing_freq2/int int name=spider_freq3/int /doc /resultlst name=facet_countslst name=facet_queries int name=product:spider1/int int name=product:amazing1/int /lst As you can see facet is actually just returning the no. of docs found against those keywrods not the actual frequency. Actual frequency is returned by the field 'amazing_freq' 'spider_freq' ! So is there any workaround for this to get the total of term-frequency in resultset without any modification to Solr source code ? Thanks, Tony On Thu, Jul 4, 2013 at 7:05 PM, Yonik Seeley yo...@lucidworks.com wrote: If you just want to retrieve those counts, this seems like simple faceting. q=something facet=true facet.query=product:hunger facet.query=product:games -Yonik http://lucidworks.com On Thu, Jul 4, 2013 at 9:45 AM, Tony Mullins tonymullins...@gmail.com wrote: Hi , I have lots of crawled data, indexed in my Solr (4.3.0) and lets say user creates a search criteria 'X1' and he/she wants to know the occurrence of a specific term in the result set of that 'X1' search criteria. And then again he/she creates another search criteria 'X2' and he/she wants to know the occurrence of that same term in the result set of that 'X2' search criteria. At the moment if I give termfreq(field,term) then it gives me the term frequency per document and if I use totaltermfreq(field,term), it gives me the total term frequency in entire index not in the result set of my search criteria. So what I need is your help to find how to how to get total occurrence of a term in query's result set. If this is my result set doc str name=typeMovies/str str name=formatdvd/str str name=productThe Hunger Games/str/doc doc str name=typeBooks/str str name=formatpaperback/str str name=productThe
SolrEntityProcessor doesn't grok responseHeader tag in Ancient Solr 1.2 source
Hi, I'd like to use the SolrEntityProcessor to partially migrate an old index to Solr 4.1. The source is pretty old (dated 2006-06-10 16:05:12Z)... maybe Solr 1.2? My data-config.xml is based on the SolrEntityProcessor example http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor and wt=xml. I'm getting an error from SolrJ complaining about responseHeader status0/status QTime1/QTime /responseHeader in the response. Does anyone know of a work-around? Thanks, Tricia 1734 T12 C0 oasc.SolrException.log SEVERE Exception while processing: sep document : SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException: org.apache.solr.common.SolrException: parsing error Caused by: org.apache.solr.common.SolrException: parsing error Caused by: java.lang.RuntimeException: this must be known type! not: responseHeader at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:222) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:128) ... 43 more
Re: SolrEntityProcessor doesn't grok responseHeader tag in Ancient Solr 1.2 source
Thanks Erik. I remember Solr Flare :) On Tue, Apr 23, 2013 at 11:56 AM, Erik Hatcher erik.hatc...@gmail.comwrote: You might be out of luck with the SolrEntityProcessor I'd recommend writing a simple little script that pages through /select?q=*:* from the source Solr and write to the destination Solr. Back in the day there was this fun little beast https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/lib/solr/importer/solr_source.rb where you could do something like this: Solr::Indexer.new(SolrSource.new(...), mapping).index Erik On Apr 23, 2013, at 13:41 , P Williams wrote: Hi, I'd like to use the SolrEntityProcessor to partially migrate an old index to Solr 4.1. The source is pretty old (dated 2006-06-10 16:05:12Z)... maybe Solr 1.2? My data-config.xml is based on the SolrEntityProcessor example http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor and wt=xml. I'm getting an error from SolrJ complaining about responseHeader status0/status QTime1/QTime /responseHeader in the response. Does anyone know of a work-around? Thanks, Tricia 1734 T12 C0 oasc.SolrException.log SEVERE Exception while processing: sep document : SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException: org.apache.solr.common.SolrException: parsing error Caused by: org.apache.solr.common.SolrException: parsing error Caused by: java.lang.RuntimeException: this must be known type! not: responseHeader at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:222) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:128) ... 43 more
Re: How do I recover the position and offset a highlight for solr (4.1/4.2)?
Hi, It doesn't have the offset information, but checkout my patch https://issues.apache.org/jira/browse/SOLR-4722 which outputs the position of each term that's been matched. I'm eager to get some feedback on this approach and any improvements that might be suggested. Cheers, Tricia On Wed, Mar 27, 2013 at 8:28 AM, Skealler Nametic bchaillou...@gmail.comwrote: Hi, I would like to retrieve the position and offset of each highlighting found. I searched on the internet, but I have not found the exact solution to my problem...
Results Order When Performing Wildcard Query
Hi, I wrote a test of my application which revealed a Solr oddity (I think). The test which I wrote on Windows 7 and makes use of the solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html fails under Ubuntu 12.04 because the Solr results I expected for a wildcard query of the test data are ordered differently under Ubuntu than Windows. On both Windows and Ubuntu all items in the result set have a score of 1.0 and appear to be ordered by docid (which looks like in corresponds to alphabetical unique id on Windows but not Ubuntu). I'm guessing that the root of my issue is that a different docid was assigned to the same document on each operating system. The data was imported using a DataImportHandler configuration during a @BeforeClass step in my JUnit test on both systems. Any suggestions on how to ensure a consistently ordered wildcard query result set for testing? Thanks, Tricia
Re: Results Order When Performing Wildcard Query
Hey Shawn, My gut says the difference in assignment of docids has to do with how the FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor works on the two operating systems. The documents are updated/imported in a different order is my guess, but I haven't tested that theory. I still think it's kind of odd that there would be a difference. Indexes are created from scratch in my test, so it's not that. java -versionreports the same values on both machines java version 1.7.0_17 Java(TM) SE Runtime Environment (build 1.7.0_17-b02) Java HotSpot(TM) Client VM (build 23.7-b01, mixed mode) The explicit (arbitrary non-score) sort parameter will work as a work-around to get my test to pass in both environments while I think about this some more. Thanks! Cheers, Tricia On Tue, Apr 9, 2013 at 2:13 PM, Shawn Heisey s...@elyograg.org wrote: On 4/9/2013 12:08 PM, P Williams wrote: I wrote a test of my application which revealed a Solr oddity (I think). The test which I wrote on Windows 7 and makes use of the solr-test-frameworkhttp://**lucene.apache.org/solr/4_1_0/** solr-test-framework/index.htmlhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html ** fails under Ubuntu 12.04 because the Solr results I expected for a wildcard query of the test data are ordered differently under Ubuntu than Windows. On both Windows and Ubuntu all items in the result set have a score of 1.0 and appear to be ordered by docid (which looks like in corresponds to alphabetical unique id on Windows but not Ubuntu). I'm guessing that the root of my issue is that a different docid was assigned to the same document on each operating system. It might be due to differences in how Java works on the two platforms, or even something as simple as different Java versions. I don't know a lot about the underlying Lucene stuff, so this next sentence may not be correct: If you have are not starting from an index where the actual index directory was deleted before the test started (rather than deleting all documents), that might produce different internal Lucene document ids. The data was imported using a DataImportHandler configuration during a @BeforeClass step in my JUnit test on both systems. Any suggestions on how to ensure a consistently ordered wildcard query result set for testing? Include an explicit sort parameter. That way it will depend on the data, not the internal Lucene representation. Thanks, Shawn
Re: Highlighting data stored outside of Solr
Your problem seems really similar to It should be possible to highlight external text https://issues.apache.org/jira/browse/SOLR-1397 in JIRA. Tricia [https://issues.apache.org/jira/browse/SOLR-1397] On Tue, Dec 11, 2012 at 12:48 PM, Michael Ryan mr...@moreover.com wrote: Has anyone ever attempted to highlight a field that is not stored in Solr? We have been considering not storing fields in Solr, but still would like to use Solr's built-in highlighting. On first glance, it looks like it would be fairly simply to modify DefaultSolrHighlighter to get the stored fields from an external source. We already do not use term vectors, so no concerns there. Any gotchas that I am not seeing? -Michael
Re: Using
Hi, Just wanted to update with a workaround. dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default exclude type=orbit/ /dependency Works for me to test my configs and project code with SolrTestCaseJ4 using IVY as a dependency manager. Does anyone else think it's odd that the directory structure solr.home/collection1 is hard coded into the test-framework? Regards, Tricia On Mon, Oct 15, 2012 at 11:19 AM, P Williams williams.tricia.l...@gmail.com wrote: Hi, Thanks for the suggestions. Didn't work for me :( I'm calling dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ which depends on org.eclipse.jetty:jetty-server which depends on org.eclipse.jetty.orbit:jettty-servlet I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493. The pom file for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom contains packagingorbit/packaging, so ivy looks for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit rather than http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar hence my troubles. I'm an IVY newbie so maybe there is something I'm missing here? Is there another 'conf' value other than 'default' I can use? Thanks, Tricia On Fri, Oct 12, 2012 at 4:32 PM, P Williams williams.tricia.l...@gmail.com wrote: Hi, Has anyone tried using dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ with Apache IVY in their project? rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit: (0ms) [ivy:resolve] shared: tried [ivy:resolve] C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit [ivy:resolve] public: tried [ivy:resolve] http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS:: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit [ivy:resolve] :: [ivy:resolve] [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Can anybody point me to the source of this error or a workaround? Thanks, Tricia
Re: How does Solr know which relative paths to use?
Hi Dotan, It seems that the examples now use Multiple Coreshttp://wiki.apache.org/solr/CoreAdminby default. If your test server is based on the stock example, you should see a solr.xml file in your CWD path which is how Solr knows about the relative paths. There should also be a README.txt file that will tell you more about how the directory is expected to be organized. Cheers, Tricia On Tue, Oct 16, 2012 at 3:50 PM, Dotan Cohen dotanco...@gmail.com wrote: I have just installed Solr 4.0 on a test server. I start it like so: $ pwd /some/dir $ java -jar start.jar The Solr Instance now looks like this: CWD /some/dir Instance /some/dir/solr/collection1 Data /some/dir/solr/collection1/data Index /some/dir/solr/collection1/data/index From where did the additional relative paths 'collection1', 'collection1/data', and 'collection1/data/index' come from? I know that I can change the value of CWD with the -Dsolr.solr.home flag, but what affects the relative paths mentioned? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Using
Hi, Thanks for the suggestions. Didn't work for me :( I'm calling dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ which depends on org.eclipse.jetty:jetty-server which depends on org.eclipse.jetty.orbit:jettty-servlet I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493. The pom file for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom contains packagingorbit/packaging, so ivy looks for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit rather than http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar hence my troubles. I'm an IVY newbie so maybe there is something I'm missing here? Is there another 'conf' value other than 'default' I can use? Thanks, Tricia On Fri, Oct 12, 2012 at 4:32 PM, P Williams williams.tricia.l...@gmail.comwrote: Hi, Has anyone tried using dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ with Apache IVY in their project? rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit: (0ms) [ivy:resolve] shared: tried [ivy:resolve] C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit [ivy:resolve] public: tried [ivy:resolve] http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS:: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit [ivy:resolve] :: [ivy:resolve] [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Can anybody point me to the source of this error or a workaround? Thanks, Tricia
Re: Using
Apologies, there was a typo in my last message. org.eclipse.jetty.orbit:jettty-servlet should have been org.eclipse.jetty.orbit:javax.servlet On Mon, Oct 15, 2012 at 11:19 AM, P Williams williams.tricia.l...@gmail.com wrote: Hi, Thanks for the suggestions. Didn't work for me :( I'm calling dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ which depends on org.eclipse.jetty:jetty-server which depends on org.eclipse.jetty.orbit:jettty-servlet I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493. The pom file for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom contains packagingorbit/packaging, so ivy looks for http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit rather than http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar hence my troubles. I'm an IVY newbie so maybe there is something I'm missing here? Is there another 'conf' value other than 'default' I can use? Thanks, Tricia On Fri, Oct 12, 2012 at 4:32 PM, P Williams williams.tricia.l...@gmail.com wrote: Hi, Has anyone tried using dependency org=org.apache.solr name=solr-test-framework rev=4.0.0 conf=test-default/ with Apache IVY in their project? rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit: (0ms) [ivy:resolve] shared: tried [ivy:resolve] C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit [ivy:resolve] public: tried [ivy:resolve] http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS:: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit [ivy:resolve] :: [ivy:resolve] [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Can anybody point me to the source of this error or a workaround? Thanks, Tricia
Re: Solr - Tika(?) memory leak
Hi, I'm not sure which version of Solr/Tika you're using but I had a similar experience which turned out to be the result of a design change to PDFBox. https://issues.apache.org/jira/browse/SOLR-2886 Tricia On Sat, Jan 14, 2012 at 12:53 AM, Wayne W waynemailingli...@gmail.comwrote: Hi, we're using Solr running on tomcat with 1GB in production, and of late we've been having a huge number of OutOfMemory issues. It seems from what I can tell this is coming from the tika extraction of the content. I've processed the java dump file using a memory analyzer and its pretty clean at least the class involved. It seems like a leak to me, as we don't parse any files larger than 20M, and these objects are taking up ~700M I've attached 2 screen shots from the tool (not sure if you receive attachments). But to summarize (class, number of objects, Used heap size, Retained Heap Size): org.apache.xmlbeans.impl.store.Xob$ElementXObj 838,993 80,533,728 604,606,040 org.apache.poi.openxml4j.opc.ZipPackage 2 112 87,009,848 char[] 58732,216,960 38,216,950 We're really desperate to find a solution to this - any ideas or help is greatly appreciated. Wayne
Re: avoid overwrite in DataImportHandler
Ah. Thanks Erick. I see now that my question is different from sabman's. Is there a way to use the DataImportHandler's full-import command so that it does not delete the existing material before it begins? Thanks, Tricia On Thu, Dec 8, 2011 at 6:35 AM, Erick Erickson erickerick...@gmail.comwrote: This is all controlled by Solr via the uniqueKey field in your schema. Just remove that entry. But then it's all up to you to handle the fact that there will be multiple documents with the same ID all returned as a result of querying. And it won't matter what program adds data, *nothing* will be overwritten, DIH has no part in that decision. Deduplication is about defining some fields in your record and avoiding adding another document if the contents are close, where close is a slippery concept. I don't think it's related to your problem at all. Best Erick On Wed, Dec 7, 2011 at 3:27 PM, P Williams williams.tricia.l...@gmail.com wrote: Hi, I've wondered the same thing myself. I feel like the clean parameter has something to do with it but it doesn't work as I'd expect either. Thanks in advance to anyone who can answer this question. *clean* : (default 'true'). Tells whether to clean up the index before the indexing is started. Tricia On Wed, Dec 7, 2011 at 12:49 PM, sabman sab...@gmail.com wrote: I have a unique ID defined for the documents I am indexing. I want to avoid overwriting the documents that have already been indexed. I am using XPathEntityProcessor and TikaEntityProcessor to process the documents. The DataImportHandler does not seem to have the option to set overwrite=false. I have read some other forums to use deduplication instead but I don't see how it is related to my problem. Any help on this (or explanation on how deduplication would apply to my probelm ) would be great. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: avoid overwrite in DataImportHandler
Hi, I've wondered the same thing myself. I feel like the clean parameter has something to do with it but it doesn't work as I'd expect either. Thanks in advance to anyone who can answer this question. *clean* : (default 'true'). Tells whether to clean up the index before the indexing is started. Tricia On Wed, Dec 7, 2011 at 12:49 PM, sabman sab...@gmail.com wrote: I have a unique ID defined for the documents I am indexing. I want to avoid overwriting the documents that have already been indexed. I am using XPathEntityProcessor and TikaEntityProcessor to process the documents. The DataImportHandler does not seem to have the option to set overwrite=false. I have read some other forums to use deduplication instead but I don't see how it is related to my problem. Any help on this (or explanation on how deduplication would apply to my probelm ) would be great. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH doesn't handle bound namespaces?
Hi Gary, From http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource *It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy ** * You should be able to use xpath=//titleInfo/title without making any modifications (removing the namespace) to your xml. I hope that answers your question. Regards, Tricia On Mon, Oct 31, 2011 at 9:24 AM, Moore, Gary gary.mo...@ars.usda.govwrote: I'm trying to import some MODS XML using DIH. The XML uses bound namespacing: mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:mods=http://www.loc.gov/mods/v3; xmlns:xlink=http://www.w3.org/1999/xlink; xmlns=http://www.loc.gov/mods/v3; xsi:schemaLocation=http://www.loc.gov/mods/v3 http://www.loc.gov/mods/v3/mods-3-4.xsd; version=3.4 mods:titleInfo mods:titleMalus domestica: Arnold/mods:title /mods:titleInfo /mods However, XPathEntityProcessor doesn't seem to handle xpaths of the type xpath=//mods:titleInfo/mods:title. If I remove the namespaces from the source XML: mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:mods=http://www.loc.gov/mods/v3; xmlns:xlink=http://www.w3.org/1999/xlink; xmlns=http://www.loc.gov/mods/v3; xsi:schemaLocation=http://www.loc.gov/mods/v3 http://www.loc.gov/mods/v3/mods-3-4.xsd; version=3.4 titleInfo titleMalus domestica: Arnold/title /titleInfo /mods then xpath=//titleInfo/title works just fine. Can anyone confirm that this is the case and, if so, recommend a solution? Thanks Gary Gary Moore Technical Lead LCA Digital Commons Project NAL/ARS/USDA
Re: Stream still in memory after tika exception? Possible memoryleak?
Hi All, I'm experiencing a similar problem to the other's in the thread. I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using the TikaEntityProcessor. My indexing would run to completion and was completely successful under the June build. The only error was readability of the fulltext in highlighting. This was fixed in Tika 0.10 (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 had recently been included (SOLR-2372). On the same machine without changing any memory settings my initial problem is a Perm Gen error. Fine, I increase the PermGen space. I've set the onError parameter to skip for the TikaEntityProcessor. Now I get several (6) *SEVERE: Exception thrown while getting data* *java.net.SocketTimeoutException: Read timed out* *SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport* *HandlerException: Exception in invoking url url removed # 2975* pairs. And after ~3881 documents, with auto commit set unreasonably frequently I consistently get an Out of Memory Error *SEVERE: Exception while processing: f document : null:org.apache.solr.handle**r.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap s**pace* The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:718). The October 30 build performs identically. Funny thing is that monitoring via JConsole doesn't reveal any memory issues. Because the out of Memory error did not occur in June, this leads me to believe that a bug has been introduced to the code since then. Should I open an issue in JIRA? Thanks, Tricia On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs jacob...@gmail.com wrote: Hi Erick, I am using Solr 3.3.0, but with 1.4.1 the same problems. The connector is a homemade program in the C# programming language and is posting via http remote streaming (i.e. http://localhost:8080/solr/update/extract?stream.file=/path/to/file.docliteral.id=1 ) I'm using Tika to extract the content (comes with the Solr Cell). A possible problem is that the filestream needs to be closed, after extracting, by the client application, but it seems that there is going something wrong while getting a Tika-exception: the stream never leaves the memory. At least that is my assumption. What is the common way to extract content from officefiles (pdf, doc, rtf, xls etc) and index them? To write a content extractor / validator yourself? Or is it possible to do this with the Solr Cell without getting a huge memory consumption? Please let me know. Thanks in advance. Marc 2011/8/30 Erick Erickson erickerick...@gmail.com What version of Solr are you using, and how are you indexing? DIH? SolrJ? I'm guessing you're using Tika, but how? Best Erick On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote: Hi all, Currently I'm testing Solr's indexing performance, but unfortunately I'm running into memory problems. It looks like Solr is not closing the filestream after an exception, but I'm not really sure. The current system I'm using has 150GB of memory and while I'm indexing the memoryconsumption is growing and growing (eventually more then 50GB). In the attached graph I indexed about 70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent throws an exception. The commits are after 64MB, 60 seconds or after a job (there are 6 evenly divided jobs). After indexing the memoryconsumption isn't dropping. Even after an optimize command it's still there. What am I doing wrong? I can't imagine I'm the only one with this problem. Thanks in advance! Kind regards, Marc
JSON and DataImportHandler
Hi All, Has anyone gotten the DataImportHandler to work with json as input? Is there an even easier alternative to DIH? Could you show me an example? Many thanks, Tricia