facets on external field
Hi, I am using external field for price field since it changes frequently. generate facets using external field? how? I understand that faceting requires indexing and external fields fields are not actually indexed. Is there any solution for this problem? -- Thanks Regards, Jainam Vora
Re: Information regarding This conf directory is not valid SolrException.
I opened SOLR-7408 to track that. Shai On Mon, Apr 13, 2015 at 3:31 PM, Bar Weiner weiner@gmail.com wrote: After some additional debugging, I think that this issue is caused by a possible race condition introduced to ZkController in Solr-5.0.0. My concerns are around unregister(...) function in ZkController. In the current code, all cores are traversed and if one of the cores is using configLocation, configLocationis variable is cleared so that its not removed from confDirectoryListeners. A possible issue can occur if, after the list of cores is fetched, a new core is added. If this new core will use the same config, then traversing all cores will not find that the configuration is used by another core, and it will be removed from confDirectoryListeners even though its still needed. In addition, when adding a watch to configuration in watchZKConfDir(..) function, no lock is used on confDirectoryListeners like in any other place where this map is accessed. A possible solution for this issue: - Add synchronized (confDirectoryListeners) to watchZKConfDir(..). - In unregister(...) function, traverse the list of cores twice. Before the first loop, obtain a lock on confDirectoryListeners, then look if any core is using configLocation, then remove configLocation from confDirectoryListeners if needed. Then the lock should be released. The second loop will be used for the rest of the code. I will be glad for any input, is this a real issue or did i miss something? Is the suggested solution valid? Thanks, Bar 2015-04-01 18:16 GMT+03:00 Bar Weiner weiner@gmail.com: Hi, I'm working on upgrading a project from solr-4.10.3 to solr-5.0.0. As part of our JUnit tests we have a few tests for deleting/creating collections. Each test createdelete a collection with a different name, but they all share the same config in ZK. When running these tests in Eclipse everything works fine, but when running the same tests through Maven we get the following error so I suspect this is a timing related issue : INFO org.apache.solr.rest.ManagedResourceStorage – Setting up ZooKeeper-based storage for the RestManager with znodeBase: /configs/SIMPLE_CONFIG INFO org.apache.solr.rest.ManagedResourceStorage – Configured ZooKeeperStorageIO with znodeBase: /configs/SIMPLE_CONFIG INFO org.apache.solr.rest.RestManager – Initializing RestManager with initArgs: {} INFO org.apache.solr.rest.ManagedResourceStorage – Reading _rest_managed.json using ZooKeeperStorageIO:path=/configs/SIMPLE_CONFIG INFO org.apache.solr.rest.ManagedResourceStorage – No data found for znode /configs/SIMPLE_CONFIG/_rest_managed.json INFO org.apache.solr.rest.ManagedResourceStorage – Loaded null at path _rest_managed.json using ZooKeeperStorageIO:path=/configs/SIMPLE_CONFIG INFO org.apache.solr.rest.RestManager – Initializing 0 registered ManagedResources INFO org.apache.solr.handler.ReplicationHandler – Commits will be reserved for 1 INFO org.apache.solr.core.SolrCore – [mycollection1] Registered new searcher Searcher@3208a6c4[mycollection1] main{ExitableDirectoryReader(UninvertingDirectoryReader())} ERROR org.apache.solr.core.CoreContainer – Error creating core [mycollection1]: This conf directory is not valid org.apache.solr.common.SolrException: This conf directory is not valid at org.apache.solr.cloud.ZkController.registerConfListenerForCore(ZkController.java:2229) at org.apache.solr.core.SolrCore.registerConfListener(SolrCore.java:2633) at org.apache.solr.core.SolrCore.init(SolrCore.java:936) at org.apache.solr.core.SolrCore.init(SolrCore.java:662) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:513) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:488) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:573) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:197) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:736) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at
RE: Indexing PDF and MS Office files
I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing: ParseContext context = new ParseContext(); //add this line: context.set(Parser.class, _autoParser) InputStream input = new FileInputStream(file); Tika 1.8 is soon to be released. If that doesn't fix your problems, please submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to make the fixes. Cheers, Tim [1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf [2] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files
Merge indexes in MapReduce
Is there a ready-to-use tool to merge existing indexes in map-reduce? We have real-time search and want to merge (and optimize) its indexes into one, so we don't need to build index in Map-Reduce, but only merge it. -- View this message in context: http://lucene.472066.n3.nabble.com/Merge-indexes-in-MapReduce-tp4200106.html Sent from the Solr - User mailing list archive at Nabble.com.
5.1 'unique' facet function / calcDistinct
Hello, We are looking at a couple of options for using solr to dynamically calulate unique values per field. In testing out Solr 5.1, I've been using the unique() facet function: http://yonik.com/solr-facet-functions/ Overall, loving the JSON Facet API, especially the sub-faceting thus far. Here's my two part question: I. When I use the unique aggregation function on a string field (uniqueValues:'unique(myStringField)'), it works as expected, returns the number of unique fields. However when I pass in an int -- or date -- field (uniqueValues:'unique(myIntField)') the resulting count is 0. The cause might be something else, but if it can be replicated by another user, would be great to discuss the unique function further -- in our current use-case, we have a field where under 20 unique values are present but the values are ints. II. Is there a way to use the stats.calcdistinct functionality and only return the countDistinct portion of the response and not the full list of distinct values -- as provided in the distinctValues portion of the response. In a field with high cardinality the response size becomes too large. If there is no such option, could someone point me in the right direction for implementing a custom solution? Thank you for your time, Levan -- View this message in context: http://lucene.472066.n3.nabble.com/5-1-unique-facet-function-calcDistinct-tp4200110.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Indexing PDF and MS Office files
This sounds like a Tika issue, let's move discussion to that list. If you are still having problems after you upgrade to Tika 1.8, please at least submit the stack traces (if you can) to the Tika jira. We may be able to find a document that triggers that stack trace in govdocs1 or the slice of CommonCrawl that Julien Nioche contributed to our eval effort. Tika is not perfect and it will fail on some files, but we are always working to improve it. Best, Tim -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you. But still the issue persists with a small percentage of documents (both PDF and MS Office documents) failing. Unfortunately, these documents are proprietary and client-confidential and hence I am not sure whether they can be uploaded into Jira. These files normally open in Adobe Reader and MS Office tools. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing: ParseContext context = new ParseContext(); //add this line: context.set(Parser.class, _autoParser) InputStream input = new FileInputStream(file); Tika 1.8 is soon to be released. If that doesn't fix your problems, please submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to make the fixes. Cheers, Tim [1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf [2] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote:
RE: Indexing PDF and MS Office files
+1 :) PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-)
Nno servers hosting shard.
Hi, I have a setup of 5 node SolrCloud (Lucene/Solr version 5.1.0) without replicas. When I am executing complex and large queries with wild-cards after some time I am getting following exceptions. The index size on each of the node is around 170GB and the memory is set to -Xms20g -Xmx24g on each node. Empty shard! org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:214) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:184) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) There is no OutofMemory or any other major lead for me to understand what had caused it. May be I am missing something. There are following other exceptions: SEVERE: null:org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://server:8080/solr/collection at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:193) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) WARNING: listener throws error org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/collection/params.json at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:163) at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:919) at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2500) at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2366) Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/collection/params.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291) at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:153) ... 3 more The Zookeeper session timeout is set to 3. In the log file I can see logs of the following pattern for all the queries I fired. INFO: [collection] webapp=/solr path=/search_handler params={sort=score+descstart=0q=(ft:search term)} status=0 QTime=time If I am not wrong they are getting executed but somehow as the shard is gone down which I can see in /clusterstate.json under the log, the search is
Re: Indexing PDF and MS Office files
Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = http://localhost:8983/solr;; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest(/update/extract);
Re: Indexing PDF and MS Office files
Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look at commons-exec :-) Cheers, Siegfried Goeschl PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-) On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote: Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to
check If I am Still Leader
Hi, I am using Solr 4.10.0 with tomcat and embedded Zookeeper. I use SolrCloud in my system. Each Shard machine try to reach/connect with other cluster machines in order to index the document ,it just checks if it is still the leader. I don't use replication so why does it has to check who is the leader? How can I bypass this constraint and make my solrcloud not use ClusterStateUpdater.checkIfIamStillLeader when i am indexing? Thanks, Adir.
Escaping in update XML messages
Hi, I am trying to delete some documents from my index by posting XML-messages to the solr. The unique key for the documents in my index is their url. The XML messages look like this: deletequeryurl:http://example.com/path/file;/query/delete For simple urls everything works fine, but if the url contains an '' like this: deletequeryurl:http://example.com/path/file?a=foob=bar;/query/delete an error occurs because the XML is not valid: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '=' (code 61); expected a semi-colon after the reference for entity 'b' Escaping '' by using 'amp;' does not help, because the query deletequeryurl:http://example.com/path/file?a=fooamp;b=bar;/query/delete does not match the url in my index. How do I need to escape or encode the url in the XML message? Thank you! Jens signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Indexing PDF and MS Office files
Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you. But still the issue persists with a small percentage of documents (both PDF and MS Office documents) failing. Unfortunately, these documents are proprietary and client-confidential and hence I am not sure whether they can be uploaded into Jira. These files normally open in Adobe Reader and MS Office tools. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing: ParseContext context = new ParseContext(); //add this line: context.set(Parser.class, _autoParser) InputStream input = new FileInputStream(file); Tika 1.8 is soon to be released. If that doesn't fix your problems, please submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to make the fixes. Cheers, Tim [1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf [2] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr
SolrCloud - Collection Browsing
Hi, I have setup a SolrCloud on 3 machines - machine1, machine2 and machine3. The DirectoryFactory used is HDFS where the collection index data is stored in HDFS within a Hadoop cluster. SlorCloud has been setup successfully and everything looks fine so far. I have uploaded the default configuration i.e. the conf folder under example/collection1 folder under the solr installation directory into Zookeeper. Essentially, I have uploaded the default configuration into Zookeeper. Now when I log in to Solr Admin using http://machine1:8983/solr/admin, I am able to see the SolrAdmin page and when I click on Cloud, I could see all the shards and replications properly in the browser. However, the issue comes when I try to open the page http://machine1:8983/solr/mycollection/browse. I am seeing a HTTP 500 lazy loading error. This looks like a trivial mistake somewhere as the collection is setup fine and everything works normal. However, when I browse the collection, this error occurs. Even when I open http://machine1:8983/solr/mycollection/query I am getting the json response properly with numFound as 0 I was expecting similar behavior like how the /browse request provides the Solritas page. Note: I haven't changed any of the configuration in the conf directory. Should I modify solrconfig.xml to have a RequestHandler for /mycollection/browse or the default one be sufficient? Can someone provide some pointers please to get this issue resolved? Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: Indexing PDF and MS Office files
Thanks Tim. I shall raise a Jira with the stack trace information. Thanks Regards Vijay On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote: This sounds like a Tika issue, let's move discussion to that list. If you are still having problems after you upgrade to Tika 1.8, please at least submit the stack traces (if you can) to the Tika jira. We may be able to find a document that triggers that stack trace in govdocs1 or the slice of CommonCrawl that Julien Nioche contributed to our eval effort. Tika is not perfect and it will fail on some files, but we are always working to improve it. Best, Tim -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you. But still the issue persists with a small percentage of documents (both PDF and MS Office documents) failing. Unfortunately, these documents are proprietary and client-confidential and hence I am not sure whether they can be uploaded into Jira. These files normally open in Adobe Reader and MS Office tools. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing: ParseContext context = new ParseContext(); //add this line: context.set(Parser.class, _autoParser) InputStream input = new FileInputStream(file); Tika 1.8 is soon to be released. If that doesn't fix your problems, please submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to make the fixes. Cheers, Tim [1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf [2] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at
Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1
For the records, what I finally did is place those words I want spellcheck to ignore in spellcheck.collateParam.fq and the words I'd like to be checked in spellcheck.q. collationQuery uses spellcheck.collateParam.fq so all did_you_mean queries return results containing words in spellcheck.collateParam.fq. Best regards, Elisabeth 2015-04-14 17:05 GMT+02:00 elisabeth benoit elisaelisael...@gmail.com: Thanks for your answer! I didn't realize this what not supposed to be done (conjunction of DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the mailing list while searching for a solution to get a list of words to ignore for the DirectSolrSpellChecker. Well well well, I'll try removing the check and see what happens. I'm not a java programmer, but if I can find a simple solution I'll let you know. Thanks again, Elisabeth 2015-04-14 16:29 GMT+02:00 Dyer, James james.d...@ingramcontent.com: Elisabeth, Currently ConjunctionSolrSpellChecker only supports adding WordBreakSolrSpellchecker to IndexBased- FileBased- or DirectSolrSpellChecker. In the future, it would be great if it could handle other Spell Checker combinations. For instance, if you had a (e)dismax query that searches multiple fields, to have a separate spellchecker for each of them. But CSSC is not hardened for this more general usage, as hinted in the API doc. The check done to ensure all spellcheckers use the same stringdistance object, I believe, is a safeguard against using this class for functionality it is not able to correctly support. It looks to me that SOLR-6271 was opened to fix the bug in that it is comparing references on the stringdistance. This is not a problem with WBSSC because this one does not support string distance at all. What you're hoping for, however, is that the requirement for the string distances be the same to be removed entirely. You could try modifying the code by removing the check. However beware that you might not get the results you desire! But should this happen, please, go ahead and fix it for your use case and then donate the code. This is something I've personally wanted for a long time. James Dyer Ingram Content Group -Original Message- From: elisabeth benoit [mailto:elisaelisael...@gmail.com] Sent: Tuesday, April 14, 2015 7:37 AM To: solr-user@lucene.apache.org Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1 Hello, I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and FileBasedSpellchecker in same request. I've applied change from patch 135.patch (cf Solr-6271). I've tried running the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe because the patch was a fix to Solr 4.9, so I just replaced line in ConjunctionSolrSpellChecker else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance.); } by else if (!stringDistance.equals(checker.getStringDistance())) { throw new IllegalArgumentException( All checkers need to use the same StringDistance!!! 1: + checker.getStringDistance() + 2: + stringDistance); } as it was done in the patch but still, when I send a spellcheck request, I get the error msg: All checkers need to use the same StringDistance!!! 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32: org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08 From error message I gather both spellchecker use same distanceMeasure LuceneLevenshteinDistance, but they're not same instance of LuceneLevenshteinDistance. Is the condition all right? What should be done to fix this properly? Thanks, Elisabeth
Re: Differentiating user search term in Solr
On 4/16/2015 7:09 AM, Steven White wrote: I cannot use escapeQueryChars method because my app interacts with Solr via REST. The summary of your email is: client's must escape search string to prevent Solr from failing. It would be a nice addition to Solr to provide a new query parameter that tells it to treat the query text as literal text. Doing so, means you remove the burden placed on clients to understand and escape reserved Solr / Lucene tokens. That's a good idea, although we might already have that. I wonder what happens if you include defType=term with your request? That works for edismax, it might work for other query parsers, at least on the q parameter. Thanks, Shawn
Re: check If I am Still Leader
On 4/16/2015 7:08 AM, Adir Ben Ami wrote: I am using Solr 4.10.0 with tomcat and embedded Zookeeper. I use SolrCloud in my system. Each Shard machine try to reach/connect with other cluster machines in order to index the document ,it just checks if it is still the leader. I don't use replication so why does it has to check who is the leader? How can I bypass this constraint and make my solrcloud not use ClusterStateUpdater.checkIfIamStillLeader when i am indexing? You might not need that functionality, but Solr must address the general case, which includes multiple replicas for each shard,where one of them will be leader. I hope this is a test installation ... running in production without fault tolerance is a bad idea. Using the embedded zookeeper in production is another bad idea, for the same reason - fault tolerance. You can file an issue in Jira for a configuration mode where the leader check is disabled. I would oppose having that happen automatically ... another replica could be added to the cloud at any time. Thanks, Shawn
Re: check If I am Still Leader
On 4/16/2015 7:42 AM, Adir Ben Ami wrote: I have not mentioned before that the index are always routed to specific machine. Is there a way to avoid connectivity from the node to all other nodes? That capability has been added in Solr 5.1.0. https://issues.apache.org/jira/browse/SOLR-6832 Thanks, Shawn
Batch collecting in PostFilter
Hi all, I am implementing a PostFilter following this article https://lucidworks.com/blog/custom-security-filtering-in-solr/ We have a requirement to call the external system only once for all the documents (max 200) so below is my change: -don't call super.collect(docId) in the collect method of the PostFilter but store all docIds in an internal map -call the external system in the finish() then call super.collect(docId) for all the docs that pass the external filtering The problem I have: docId exceeds maxDoc (docID must be = 0 and maxDoc=10 (got docID=123456) I suspect I am storing local docIds and when Reader is changed, docBase is also changed so the global docId, which I believe is constructed in super.collect() using the parameter docId and docBase, becomes incorrect. Could anyone point me to the right direction? Thanks, -Ha
Re: Differentiating user search term in Solr
On 4/16/2015 7:49 AM, Steven White wrote: defType didn't work: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene Gave me error: org.apache.solr.search.SyntaxError: Expected identifier at pos 27 str='{!q.op=AND df=text solr sys' Is my use of defType correct? If everything is at defaults and you don't have defType in the handler definition, then defType=lucene doesn't do anything - it specifically says use the lucene parser which is the default. You want defType=term instead. Thanks, Shawn
Re: How can I temporarily detach node from SolrCloud?
On 4/16/2015 8:27 AM, Oded Sofer wrote: How can I detach node from SolrCloud (temporarily for maintenance and such and attach it back after some time). We are using SolrCloud 4.10.0; One Collection, and Shard per node. The add-index is routed to specific machine base on our customize routing logic (kind of hard-coded) I assume this is just one replica out of multiple ... if that's the case, just shut the node down, do your maintenance, and bring it back online. SolrCloud will automatically make sure the index replica(s) on the node are brought up to date to match the others. If it's not one replica of multiple (that is, if it has the only copy of one or more shards), then shutting it down will either reduce your result set or cause queries to return an error, not sure which. Thanks, Shawn
Conditional Filter Queries
Hi, I want to filter my search results by different date fields based on content type. In other words: if contentType is A, filter out results that are older than 1 year; if contentType is B, filter out results that are older than 2 years; otherwise, date does not matter. Is that possible with fq parameters? Would it be something like fq=(contentType:A AND startDate:[NOW-1YEAR TO NOW]) OR (contentType:B AND startDate:[NOW-2YEAR TO NOW]) OR !contentType: (A or B) Is there a better way to do this? Thanks, Jing
Re: Differentiating user search term in Solr
What is term in the defType=term, do you mean the raw word term or something else? Because I tried that too in two different ways: Using correct Solr syntax: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text}%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term This throws a NPE exception: java.lang.NullPointerException at org.apache.solr.schema.IndexSchema$DynamicReplacement$DynamicPattern$NameEndsWith.matches(IndexSchema.java:1033) at org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:1047) at org.apache.solr.schema.IndexSchema.dynFieldType(IndexSchema.java:1303) at org.apache.solr.schema.IndexSchema.getFieldTypeNoEx(IndexSchema.java:1280) at org.apache.solr.search.TermQParserPlugin$1.parse(TermQParserPlugin.java:56) at And when I try it with invalid Solr search syntax: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term This gives me the SyntaxError: org.apache.solr.search.SyntaxError: Expected identifier at pos 27 str='{!q.op=AND df=text solr sys' What am I missing? Steve On Thu, Apr 16, 2015 at 10:43 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 7:49 AM, Steven White wrote: defType didn't work: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene Gave me error: org.apache.solr.search.SyntaxError: Expected identifier at pos 27 str='{!q.op=AND df=text solr sys' Is my use of defType correct? If everything is at defaults and you don't have defType in the handler definition, then defType=lucene doesn't do anything - it specifically says use the lucene parser which is the default. You want defType=term instead. Thanks, Shawn
Re: Differentiating user search term in Solr
On 4/16/2015 9:37 AM, Steven White wrote: What is term in the defType=term, do you mean the raw word term or something else? Because I tried that too in two different ways: Oops. I forgot that the term query parser (that's what term means -- the name of the query parser) requires that you specify the field you are searching on, so that would be incomplete. Try also setting the f parameter to the field that you want to search. I will not be surprised if that doesn't work, though. Thanks, Shawn
Re: Merge indexes in MapReduce
You're stating two things that are somewhat antithetical: 1: We have real-time search and 2: want to merge (and optimize) its indexes into one Needing to merge indexes implies (to me at least) that you're not really doing NRT processing as docs in the batch you're merging into your collection aren't searchable, thus not NRT. I'm probably missing something obvious in your problem statement The MapReduceIndexerTool probably doesn't quite do what you want as its purpose is to add documents to the index and merge at the end... You might get some value from the core admin API MERGEINDEXES call: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-MERGEINDEXES But you have to be careful in a sharded situation to merge exactly correctly. Plus, merging indexes does NOT replace documents with a particular uniqueKey that happens to be both in the source and dest indexes. I wouldn't worry too much about optimization, despite its name it's largely irrelevant at this point unless you have a bunch of deleted documents in your index. Best, Erick On Thu, Apr 16, 2015 at 4:14 AM, Norgorn lsunnyd...@mail.ru wrote: Is there a ready-to-use tool to merge existing indexes in map-reduce? We have real-time search and want to merge (and optimize) its indexes into one, so we don't need to build index in Map-Reduce, but only merge it. -- View this message in context: http://lucene.472066.n3.nabble.com/Merge-indexes-in-MapReduce-tp4200106.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Differentiating user search term in Solr
On 4/16/2015 10:10 AM, Steven White wrote: I don't follow what the f parameter is. Do you have a link where I can read more about it? I found this https://wiki.apache.org/solr/HighlightingParameters and https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is what you mean (I'm not doing highlighting for faceting). It looks like this isn't going to work. I just tried it on my index. To see the reasoning behind what I was suggesting, click here: https://cwiki.apache.org/confluence/display/solr/Other+Parsers And then click on Term Query Parser in the third column of the list at the top of that page. The syntax for the localparams on this one is {!term f=field}querytext ... so I was hoping that f would work as a URL parameter, but from the test I just did on Solr 4.9.1, that's not the case. Thanks, Shawn
Re: 1:M connectivity
You say the SolrCloud API. Not entirely sure what that is, do you mean the post.jar tool? Because to get much more scalable throughput, you probably want to use SolrJ and the CloudSolrServer class. That class takes a connection to Zookeeper and does the right thing. Best, Erick On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid wrote: Given that the index are always routed to specific machine, is there a way to avoid connectivity from the node to all other node. We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always added to the node that get API request for add-index (i.e., we are sending the add index to the appropriate node that should get it).
Re: SolrCloud - Collection Browsing
Check that your config has a valid path to the velocity contrib. You should see something like lib dir=${solr.install.dir:../../..}/contrib/velocity/lib regex=.*\.jar / (from Solr 4.10). and you should also see the indicated file on each of your Solr nodes. What's the full stack BTW? I'm expecting something like a class not found error somewhere down in the stack. Best, Erick On Thu, Apr 16, 2015 at 3:21 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I have setup a SolrCloud on 3 machines - machine1, machine2 and machine3. The DirectoryFactory used is HDFS where the collection index data is stored in HDFS within a Hadoop cluster. SlorCloud has been setup successfully and everything looks fine so far. I have uploaded the default configuration i.e. the conf folder under example/collection1 folder under the solr installation directory into Zookeeper. Essentially, I have uploaded the default configuration into Zookeeper. Now when I log in to Solr Admin using http://machine1:8983/solr/admin, I am able to see the SolrAdmin page and when I click on Cloud, I could see all the shards and replications properly in the browser. However, the issue comes when I try to open the page http://machine1:8983/solr/mycollection/browse. I am seeing a HTTP 500 lazy loading error. This looks like a trivial mistake somewhere as the collection is setup fine and everything works normal. However, when I browse the collection, this error occurs. Even when I open http://machine1:8983/solr/mycollection/query I am getting the json response properly with numFound as 0 I was expecting similar behavior like how the /browse request provides the Solritas page. Note: I haven't changed any of the configuration in the conf directory. Should I modify solrconfig.xml to have a RequestHandler for /mycollection/browse or the default one be sufficient? Can someone provide some pointers please to get this issue resolved? Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
RE: Indexing PDF and MS Office files
If you use pdftotext with a simple fork/exec per document, you will get about 5 MB/s throughput on a single AMD x86_64. Much of that is because of the fork/exec. I suggest that you use HTML output and UTF-8 encoding for the PDF, because that way you can get title/keywords and such as http meta keywords. If you have the appetite for something truly great, try: - Socket server listening for parsing requests - pass off accept() sockets to pre-forked children - in the children, use vfork, rather than fork - tmpfs for outputted HTML documents - Tempting to implement using mod_perl and httpd, at least to me. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, April 16, 2015 7:53 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look at commons-exec :-) Cheers, Siegfried Goeschl PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-) On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote: Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that
RE: Indexing PDF and MS Office files
Indeed. Another solution is to purchase ABBYY or Nuance as a server, and have them do that work. You will even get OCR.Both offer a Linux SDK. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, April 16, 2015 7:56 AM To: solr-user@lucene.apache.org Subject: RE: Indexing PDF and MS Office files +1 :) PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-)
Re: Differentiating user search term in Solr
I don't follow what the f parameter is. Do you have a link where I can read more about it? I found this https://wiki.apache.org/solr/HighlightingParameters and https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is what you mean (I'm not doing highlighting for faceting). Thanks Steve On Thu, Apr 16, 2015 at 11:54 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 9:37 AM, Steven White wrote: What is term in the defType=term, do you mean the raw word term or something else? Because I tried that too in two different ways: Oops. I forgot that the term query parser (that's what term means -- the name of the query parser) requires that you specify the field you are searching on, so that would be incomplete. Try also setting the f parameter to the field that you want to search. I will not be surprised if that doesn't work, though. Thanks, Shawn
Re: check If I am Still Leader
bq: I don't use replication so why does it has to check who is the leader Because the doc must be routed to the correct shard, and the shard leader is the machine that coordinates the indexing for that shard. I really question whether this is a fruitful course for you to take. What specific problems are you trying to solve here? Because trying to take control at this level really shouldn't be done unless and until you have a problem that's causing you grief, it's just a waste of energy until then IMO. Best, Erick On Thu, Apr 16, 2015 at 7:59 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 7:42 AM, Adir Ben Ami wrote: I have not mentioned before that the index are always routed to specific machine. Is there a way to avoid connectivity from the node to all other nodes? That capability has been added in Solr 5.1.0. https://issues.apache.org/jira/browse/SOLR-6832 Thanks, Shawn
Re: Differentiating user search term in Solr
On 4/16/2015 10:18 AM, Shawn Heisey wrote: On 4/16/2015 10:10 AM, Steven White wrote: I don't follow what the f parameter is. Do you have a link where I can read more about it? I found this https://wiki.apache.org/solr/HighlightingParameters and https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is what you mean (I'm not doing highlighting for faceting). It looks like this isn't going to work. I just tried it on my index. I filed an enhancement issue. It might never happen, but it's in the system. https://issues.apache.org/jira/browse/SOLR-7410 Thanks, Shawn
Re: Differentiating user search term in Solr
Thanks for trying Shawn. Looks like I have to escape the string on my client side (this isn't a clean design and can lead to errors if not all reserved tokens are not escaped). I hope folks from @dev are reading this and consider adding a parameter to tell Solr the text is raw-text. Steve On Thu, Apr 16, 2015 at 12:18 PM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 10:10 AM, Steven White wrote: I don't follow what the f parameter is. Do you have a link where I can read more about it? I found this https://wiki.apache.org/solr/HighlightingParameters and https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is what you mean (I'm not doing highlighting for faceting). It looks like this isn't going to work. I just tried it on my index. To see the reasoning behind what I was suggesting, click here: https://cwiki.apache.org/confluence/display/solr/Other+Parsers And then click on Term Query Parser in the third column of the list at the top of that page. The syntax for the localparams on this one is {!term f=field}querytext ... so I was hoping that f would work as a URL parameter, but from the test I just did on Solr 4.9.1, that's not the case. Thanks, Shawn
Re: How can I temporarily detach node from SolrCloud?
bq: it down will either reduce your result set or cause queries to return an error Setting shards.tolerant=true will reduce your result set. If you don't set that and all replicas of a shard are down, you'll get an error. And indexing won't work if all the replicas for a shard are down. Best, Erick On Thu, Apr 16, 2015 at 7:46 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 8:27 AM, Oded Sofer wrote: How can I detach node from SolrCloud (temporarily for maintenance and such and attach it back after some time). We are using SolrCloud 4.10.0; One Collection, and Shard per node. The add-index is routed to specific machine base on our customize routing logic (kind of hard-coded) I assume this is just one replica out of multiple ... if that's the case, just shut the node down, do your maintenance, and bring it back online. SolrCloud will automatically make sure the index replica(s) on the node are brought up to date to match the others. If it's not one replica of multiple (that is, if it has the only copy of one or more shards), then shutting it down will either reduce your result set or cause queries to return an error, not sure which. Thanks, Shawn
Re: Indexing PDF and MS Office files
For MS Word documents, one common pattern for all failed documents I noticed is that all of them contain embedded images (like scanned signature images embedded into the documents. These documents are much like some letterheads where someone scanned the signature image and then embedded into the document along with the text) with in the documents. For other documents which completed successfully, no images were present. Just wondering if these are causing the issue. Thanks Regards Vijay On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks Tim. I shall raise a Jira with the stack trace information. Thanks Regards Vijay On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote: This sounds like a Tika issue, let's move discussion to that list. If you are still having problems after you upgrade to Tika 1.8, please at least submit the stack traces (if you can) to the Tika jira. We may be able to find a document that triggers that stack trace in govdocs1 or the slice of CommonCrawl that Julien Nioche contributed to our eval effort. Tika is not perfect and it will fail on some files, but we are always working to improve it. Best, Tim -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you. But still the issue persists with a small percentage of documents (both PDF and MS Office documents) failing. Unfortunately, these documents are proprietary and client-confidential and hence I am not sure whether they can be uploaded into Jira. These files normally open in Adobe Reader and MS Office tools. Thanks Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote: I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you can -- bad things can happen if you don't [1] [2]. Erick's blog on SolrJ is fantastic. If you want to have Tika parse embedded documents/attachments, make sure to set the parser in the ParseContext before parsing: ParseContext context = new ParseContext(); //add this line: context.set(Parser.class, _autoParser) InputStream input = new FileInputStream(file); Tika 1.8 is soon to be released. If that doesn't fix your problems, please submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to make the fixes. Cheers, Tim [1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf [2] http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf -Original Message- From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will
Re: Indexing PDF and MS Office files
On 16/04/2015 12:53, Siegfried Goeschl wrote: Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) Here's some file extractors we built a while ago: https://github.com/flaxsearch/flaxcode/tree/master/flax_filters You might find them useful: they use a number of external programs including pdf2text and headless Open Office. Cheers Charlie If you start command line tools from your JVM please have a look at commons-exec :-) Cheers, Siegfried Goeschl PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-) On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote: Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if this behaviour can be modified so that all the documents can be indexed. The business requirement we have is to index all the documents. However, if a small percentage of them fails, not sure what other ways exist to index them. Any help please? Thanks Regards Vijay On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote: There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale all that well. So an alternative is to use SolrJ with Tika, which is totally independent of what version of Tika is on the Solr server. Here's an example. http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is java.lang.IllegalArgumentException: This paragraph is not the first one in the table which will eventually result in Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser Upon some reading, it looks like its a bug with Tika 1.5 and seems to have been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ). I am new to Solr / Tika and hence wondering whether I can change the Tika library alone to v1.6 without impacting any of the libraries within Solr 4.10.2? Please let me know your response and how to get away with this issue. Many thanks in advance. Thanks Regards Vijay On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote: Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com wrote: Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/ Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process. I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread
How can I temporarily detach node from SolrCloud?
How can I detach node from SolrCloud (temporarily for maintenance and such and attach it back after some time). We are using SolrCloud 4.10.0; One Collection, and Shard per node. The add-index is routed to specific machine base on our customize routing logic (kind of hard-coded)
Re: Differentiating user search term in Solr
Thanks Shawn. I cannot use escapeQueryChars method because my app interacts with Solr via REST. The summary of your email is: client's must escape search string to prevent Solr from failing. It would be a nice addition to Solr to provide a new query parameter that tells it to treat the query text as literal text. Doing so, means you remove the burden placed on clients to understand and escape reserved Solr / Lucene tokens. Steve On Wed, Apr 15, 2015 at 7:18 PM, Shawn Heisey apa...@elyograg.org wrote: On 4/15/2015 3:54 PM, Steven White wrote: Hi folks, If a user types in the search box (without quotes): {!q.op=AND df=text solr sys and I take that text and build the URL like so: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=true This will fail with Expected identifier because it is not a valid Solr text. That isn't valid syntax for the lucene query parser ... the localparams are not closed (it would require a } character), and after the localparams there would need to be some additional text. My question is this: is there a flag I can send to Solr with the URL telling it to treat what's in q as raw text vs. having it to process it as a Solr syntax? If not, than it means I have to escape all Solr reserved characters and words. If so, where can I find the complete list? Also, what happens when a new reserved characters or word is added to Solr down the road? It means I have to upgrade my application too, which is something I would like to avoid. One way to treat the entire input as literal text is to use the terms query parser ... but that requires the localparams syntax, and I do not know exactly what is going to happen if you use a query string that itself is localparams syntax -- {! other params} ... so escaping is probably safer. https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser The other way to handle it is to escape every special character with a backslash. The escapeQueryChars method in SolrJ is always kept up to date, and can escape every special character. http://lucene.apache.org/solr/4_10_3/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29 The javadoc for that method points to the queryparser syntax for more info on characters that need escaping. Scroll to the very end of this page: http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true That page lists || and rather than just the single characters | and ... the escapeQueryChars method in SolrJ will escape both characters, as it only works at the character level, not the string level. If you want the *spaces* in your query to be treated literally also, you must escape them too. The escapeQueryChars method I've mentioned will NOT escape spaces. Note that this does not cover URL escaping -- the character must be sent as %26 or the servlet container will treat it as a special character, before it even gets to Solr. Thanks, Shawn
RE: check If I am Still Leader
I have not mentioned before that the index are always routed to specific machine. Is there a way to avoid connectivity from the node to all other nodes? From: adi...@hotmail.com To: solr-user@lucene.apache.org Subject: check If I am Still Leader Date: Thu, 16 Apr 2015 16:08:15 +0300 Hi, I am using Solr 4.10.0 with tomcat and embedded Zookeeper. I use SolrCloud in my system. Each Shard machine try to reach/connect with other cluster machines in order to index the document ,it just checks if it is still the leader. I don't use replication so why does it has to check who is the leader? How can I bypass this constraint and make my solrcloud not use ClusterStateUpdater.checkIfIamStillLeader when i am indexing? Thanks, Adir.
1:M connectivity
Given that the index are always routed to specific machine, is there a way to avoid connectivity from the node to all other node. We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always added to the node that get API request for add-index (i.e., we are sending the add index to the appropriate node that should get it).
Re: Differentiating user search term in Solr
defType didn't work: http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene Gave me error: org.apache.solr.search.SyntaxError: Expected identifier at pos 27 str='{!q.op=AND df=text solr sys' Is my use of defType correct? Steve On Thu, Apr 16, 2015 at 9:15 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/16/2015 7:09 AM, Steven White wrote: I cannot use escapeQueryChars method because my app interacts with Solr via REST. The summary of your email is: client's must escape search string to prevent Solr from failing. It would be a nice addition to Solr to provide a new query parameter that tells it to treat the query text as literal text. Doing so, means you remove the burden placed on clients to understand and escape reserved Solr / Lucene tokens. That's a good idea, although we might already have that. I wonder what happens if you include defType=term with your request? That works for edismax, it might work for other query parsers, at least on the q parameter. Thanks, Shawn
custom search component on solrcloud
Hi Apologize for sending this again. I am trying to port my none solrcloud custom search handler to a solrcloud one. I have read the WritingDistibutedSearchComponents http://wiki.apache.org/solr/WritingDistributedSearchComponents wiki page and looked at Terms and Querycomponent codes but the control flow of execution is still fuzzy (even given the “distributed algorithm” description). Concretely, I have a none solrcloud algorithm that given a sequence of tokens T would 1- split T into single tokens 2- foreach token t_i get all the DocList for t_i by executing rb.req.getSearcher().getDocList in process() method of the custom search component 3- do some magic on the collection of doclists My question is how can i 1) do the splitting (step 1 above) in a single shard, and 2) distribute the getDocList for each token t_i to all shards 3) wait till i have all the doclists from all shards, then 4) do something with the results, in the original calling shard (step 1 above). Thank you for your help
Re: Spurious _version_ conflict?
: I notice that the expected value in the error message matches both what : I pass in and the index contents. But the actual value in the error : message is different only in the last (low order) two digits. : Consistently. what does your client code look like? Are you sure you aren't being bit by a JSON parsing library that can't handle long values and winds up truncating them? https://issues.apache.org/jira/browse/SOLR-6364 -Hoss http://www.lucidworks.com/
Re: Differentiating user search term in Solr
: The summary of your email is: client's must escape search string to prevent : Solr from failing. : : It would be a nice addition to Solr to provide a new query parameter that : tells it to treat the query text as literal text. Doing so, means you : remove the burden placed on clients to understand and escape reserved Solr : / Lucene tokens. i'm a little lost as to what exactly you want to do here -- but i'm going to focus on your thesis statement here, and assume that you want to search on a literal piece of text and you don't want to have to worry about escaping any characters and you don't wantsolr to treat any part of the query string as special. the only way something like that works is if you only want to search a single field -- searching multiple fields, searching multiple clauses, etc... none of those types of options make sense in this context. people have already mentioned the term parser -- which is fine ifyou want to serach for exactly one literal term, but as a more generally solution, what people usualy want, is the field parser -- which works better with TextFields in general... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser Just like the comment you've seen about the term parser needing an f localparam to specify the field, the same is true for the field parser. but variable refrences make this trivial to specify -- instead of using the full {!field f=myfield}Foo Bar syntax in your q param, you can use an alternate param (qq is common in many examples) for the raw data from the user... q={!field f=myfield v=$qq} qq=whatever your usertypes https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries -Hoss http://www.lucidworks.com/
Re: Solr 5.x deployment in production
Thanks Karl. In my case, I have to deploy Solr on Windows, AIX, and Linux (all server edition). We are a WebSphere shop, moving away from it means I have to deal with politics and culture. For Windows, I cannot use NSSM so I have to figure a solution managing Solr (at least start-up and shutdown). If anyone has experience in this area (now that Solr is not in a WAS profile managed by Windows services) and can share your experience, please do. Thanks. Steve On Thu, Apr 16, 2015 at 3:49 PM, Karl Kildén karl.kil...@gmail.com wrote: I asked a very similar question recently. You should switch to using the package as is and forget that it contains a .war. The war is now an internal component. Also switch to the new script for startup etc. I have seen several disappointed users that disagree with this decision but I assume the project now has more freedom in the future and also more alignment and focus on one experience. I did my own thing with NSSM because we use windows and I am satisfied. On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote: Hi folks, With Solr 5.0, the WAR file is deprecated and I see Jetty is included with Solr. What if I have my own Web server into which I need to deploy Solr, how do I go about doing this correctly without messing things up and making sure Solr works? Or is this not recommended and Jetty is the way to go, no questions asked? Thanks Steve
Re: 1:M connectivity
Right, we are using that. The issue is the firewall setting needed for the cloud. We do not want to open all nodes to all others nodes. However, we found that add-index to a specific node tries to access all other nodes though we set it to index locally on that node only. On Apr 16, 2015 7:19 PM, Erick Erickson erickerick...@gmail.com wrote: You say the SolrCloud API. Not entirely sure what that is, do you mean the post.jar tool? Because to get much more scalable throughput, you probably want to use SolrJ and the CloudSolrServer class. That class takes a connection to Zookeeper and does the right thing. Best, Erick On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid wrote: Given that the index are always routed to specific machine, is there a way to avoid connectivity from the node to all other node. We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always added to the node that get API request for add-index (i.e., we are sending the add index to the appropriate node that should get it).
Spurious _version_ conflict?
Hi All, I have been getting intermittent 409 conflict responses to updates. I check and double-check that the _version_ I am passing in matches the current value in the index. I notice that the expected value in the error message matches both what I pass in and the index contents. But the actual value in the error message is different only in the last (low order) two digits. Consistently. I noticed a similar report a while back: http://lucene.472066.n3.nabble.com/Version-Conflict-on-Atomic-Update-td4083587.html Any thoughts? Thanks, Charlie * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: generate uuid/ id for table which do not have any primary key
Thanks Kaushik Erick.. Though I can populate uuid by using combination of fields but need to change the type to string else it throws Invalid UUID String field name=uuid type=string indexed=true stored=true required=true multiValued=false/ a) I will have ~80 millions records and wondering if performance might be issue b) So, during update I can still use combination of fields i.e. uuid ? On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com wrote: This seems relevant: http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid Best, Erick On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote: You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Re: generate uuid/ id for table which do not have any primary key
Just wondering if there is a way to generate uuid/ id in data-config without using combination of fields in query... data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig On Thu, Apr 16, 2015 at 3:18 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks Kaushik Erick.. Though I can populate uuid by using combination of fields but need to change the type to string else it throws Invalid UUID String field name=uuid type=string indexed=true stored=true required=true multiValued=false/ a) I will have ~80 millions records and wondering if performance might be issue b) So, during update I can still use combination of fields i.e. uuid ? On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com wrote: This seems relevant: http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid Best, Erick On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote: You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Solr 5.x deployment in production
Hi folks, With Solr 5.0, the WAR file is deprecated and I see Jetty is included with Solr. What if I have my own Web server into which I need to deploy Solr, how do I go about doing this correctly without messing things up and making sure Solr works? Or is this not recommended and Jetty is the way to go, no questions asked? Thanks Steve
Re: Solr 5.x deployment in production
I asked a very similar question recently. You should switch to using the package as is and forget that it contains a .war. The war is now an internal component. Also switch to the new script for startup etc. I have seen several disappointed users that disagree with this decision but I assume the project now has more freedom in the future and also more alignment and focus on one experience. I did my own thing with NSSM because we use windows and I am satisfied. On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote: Hi folks, With Solr 5.0, the WAR file is deprecated and I see Jetty is included with Solr. What if I have my own Web server into which I need to deploy Solr, how do I go about doing this correctly without messing things up and making sure Solr works? Or is this not recommended and Jetty is the way to go, no questions asked? Thanks Steve
Re: Indexing PDF and MS Office files
Turning PDF back into a structured document is like trying to turn hamburger back into a cow. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. talli...@mitre.org wrote: +1 :) PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-)
SolrCloud Core Reload
Hi all, I have a solrcloud cluster with 3 server and there are many cores. Using the SolrCloud UI Admin Core, if I execute core optimize (or reload), all the core in the cluster will be optimized or reloaded? or only the selected core?. Best regards, Vincenzo
Re: Range facets in sharded search
This looks like a bug. The logic to merge range facets from shards seems to only be merging counts, not the first level elements. Could you create a Jira? On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote: I am seeing some some odd behavior with range facets across multiple shards. When querying each node directly with distrib=false the facet returned matches what is expected. When doing the same query against the collection and it spans the two shards, the facet after and between buckets are wrong. I can re-create a similar problem using the out of the box example scripts and data. I am running on Windows and tested both Solr 5.0.0 and 5.1.0. This is the steps to reproduce: c:\solr-5.1.0\solr -e cloud These are the selections I made: (specify 1-4 nodes) [2]: 2 Please enter the port for node1 [8983]: 8983 Please enter the port for node2 [7574]: 7574 Please provide a name for your new collection: [gettingstarted] gettingstarted How many shards would you like to split gettingstarted into? [2] 2 How many replicas per shard would you like to create? [2] 1 Please choose a configuration ... [data_driven_schema_configs] sample_techproducts_configs I then posted some of the sample XMLs: C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml, monitor2.xml,mp500.xml, sd500.xml This first query is against node1 with distrib=false: http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND There are 7 Results (results ommited). facet_ranges:{ price:{ counts:[ 0.0,1, 20.0,0, 40.0,0, 60.0,0, 80.0,1], gap:20.0, start:0.0, end:100.0, before:0, after:5, between:2}}, This second query is against node2 with distrib=false: http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND 7 Results (one product does not have a price): facet_ranges:{ price:{ counts:[ 0.0,1, 20.0,0, 40.0,0, 60.0,1, 80.0,0], gap:20.0, start:0.0, end:100.0, before:0, after:4, between:2}}, Finally querying the entire collection: http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND 14 results (one without a price range): facet_ranges:{ price:{ counts:[ 0.0,2, 20.0,0, 40.0,0, 60.0,1, 80.0,1], gap:20.0, start:0.0, end:100.0, before:0, after:5, between:2}}, Notice that both the after and the between are wrong here. The actual buckets do correctly represent the right values but I would expect between to be 5 and after to be 13. There appears to be a recently fixed issue ( https://issues.apache.org/jira/browse/SOLR-6154) with range facet in distributed queries but it was related to buckets not always appearing with mincount=1 for the field. This looks like it is a different problem. Anyone have any suggestions or notice anythign wrong with my query parameters? I can open a Jira ticket but wanted to run it by the larger audience first to see if I am missing anything obvious. Thanks, Will
SolrJ Exceptions
I'm trying to identify the difference between an exception when Solr is in a bad state/down vs. when it is up but an invalid request was made (maybe some bad data sent in). The JavaDoc for SolrRequest process() says: *@throws SolrServerException if there is an error on the Solr server@throws IOException if there is a communication error* So I expected IOException when Solr was down, but it looks like it actually throws a SolrServerException which has a cause of an IOException. I'm also not sure how SolrException fits into all of this... Is anyone familiar with when to generally expect these types of exceptions? I'm interested in both cloud and stand-alone scenarios, and using Solr 5.0 or 5.1. Thanks, Bryan
SolrCloud 4.8.0 upgrade
Hi All, I have a SolrCloud cluster with 3 server, I would like to use stats.facet, but this feature is available only if I upgrade to 4.10. May I simply redeploy new solr cloud version in tomcat or should reload all the documents? There are other drawbacks? Best regards, Vincenzo
Re: 5.1 'unique' facet function / calcDistinct
Thanks for the feedback Levan! Could you open a JIRA issue for unique() on numeric/date fields? We don't yet have explicit numeric support for unique() and I think some changes in Lucene 5 broke treating these fields as strings (i.e. the ability to retrieve ords). -Yonik On Thu, Apr 16, 2015 at 7:46 AM, levanDev levandev9...@gmail.com wrote: Hello, We are looking at a couple of options for using solr to dynamically calulate unique values per field. In testing out Solr 5.1, I've been using the unique() facet function: http://yonik.com/solr-facet-functions/ Overall, loving the JSON Facet API, especially the sub-faceting thus far. Here's my two part question: I. When I use the unique aggregation function on a string field (uniqueValues:'unique(myStringField)'), it works as expected, returns the number of unique fields. However when I pass in an int -- or date -- field (uniqueValues:'unique(myIntField)') the resulting count is 0. The cause might be something else, but if it can be replicated by another user, would be great to discuss the unique function further -- in our current use-case, we have a field where under 20 unique values are present but the values are ints. II. Is there a way to use the stats.calcdistinct functionality and only return the countDistinct portion of the response and not the full list of distinct values -- as provided in the distinctValues portion of the response. In a field with high cardinality the response size becomes too large. If there is no such option, could someone point me in the right direction for implementing a custom solution? Thank you for your time, Levan
Re: Differentiating user search term in Solr
Hi Hoss, Maybe I'm missing something, but I tried this and got 1 hit: http://localhost:8983/solr/db/select?q=title:(Apache%20Solr%20Notes)fl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND Than I tried this and got 0 hit: http://localhost:8983/solr/db/select?q={!field%20f=title%20v=$qq}qq=Apache%20Solr%20Notesfl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND It looks to me that f with qq is doing phrase search, that's not what I want. The data in the field title is Apache Solr Release Notes I looked over the links you provided and tried out the examples, in each case if the user-typed-text contains any reserved characters, it will fail with a syntax error (the exception is when I used f and qq but like I said, that gave me 0 hit). If you can give me a concrete example, please do. My need is to pass to Solr the text Apache: Solr Notes (without quotes) and get a hit as if I passed Apache\: Solr Notes ? Thanks Steve On Thu, Apr 16, 2015 at 5:49 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : The summary of your email is: client's must escape search string to prevent : Solr from failing. : : It would be a nice addition to Solr to provide a new query parameter that : tells it to treat the query text as literal text. Doing so, means you : remove the burden placed on clients to understand and escape reserved Solr : / Lucene tokens. i'm a little lost as to what exactly you want to do here -- but i'm going to focus on your thesis statement here, and assume that you want to search on a literal piece of text and you don't want to have to worry about escaping any characters and you don't wantsolr to treat any part of the query string as special. the only way something like that works is if you only want to search a single field -- searching multiple fields, searching multiple clauses, etc... none of those types of options make sense in this context. people have already mentioned the term parser -- which is fine ifyou want to serach for exactly one literal term, but as a more generally solution, what people usualy want, is the field parser -- which works better with TextFields in general... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser Just like the comment you've seen about the term parser needing an f localparam to specify the field, the same is true for the field parser. but variable refrences make this trivial to specify -- instead of using the full {!field f=myfield}Foo Bar syntax in your q param, you can use an alternate param (qq is common in many examples) for the raw data from the user... q={!field f=myfield v=$qq} qq=whatever your usertypes https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries -Hoss http://www.lucidworks.com/
Re: Range facets in sharded search
Should be fixed in 5.2. See https://issues.apache.org/jira/browse/SOLR-7412 On Thu, Apr 16, 2015 at 3:18 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: This looks like a bug. The logic to merge range facets from shards seems to only be merging counts, not the first level elements. Could you create a Jira? On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote: I am seeing some some odd behavior with range facets across multiple shards. When querying each node directly with distrib=false the facet returned matches what is expected. When doing the same query against the collection and it spans the two shards, the facet after and between buckets are wrong. I can re-create a similar problem using the out of the box example scripts and data. I am running on Windows and tested both Solr 5.0.0 and 5.1.0. This is the steps to reproduce: c:\solr-5.1.0\solr -e cloud These are the selections I made: (specify 1-4 nodes) [2]: 2 Please enter the port for node1 [8983]: 8983 Please enter the port for node2 [7574]: 7574 Please provide a name for your new collection: [gettingstarted] gettingstarted How many shards would you like to split gettingstarted into? [2] 2 How many replicas per shard would you like to create? [2] 1 Please choose a configuration ... [data_driven_schema_configs] sample_techproducts_configs I then posted some of the sample XMLs: C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml, monitor2.xml,mp500.xml, sd500.xml This first query is against node1 with distrib=false: http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND There are 7 Results (results ommited). facet_ranges:{ price:{ counts:[ 0.0,1, 20.0,0, 40.0,0, 60.0,0, 80.0,1], gap:20.0, start:0.0, end:100.0, before:0, after:5, between:2}}, This second query is against node2 with distrib=false: http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND 7 Results (one product does not have a price): facet_ranges:{ price:{ counts:[ 0.0,1, 20.0,0, 40.0,0, 60.0,1, 80.0,0], gap:20.0, start:0.0, end:100.0, before:0, after:4, between:2}}, Finally querying the entire collection: http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND 14 results (one without a price range): facet_ranges:{ price:{ counts:[ 0.0,2, 20.0,0, 40.0,0, 60.0,1, 80.0,1], gap:20.0, start:0.0, end:100.0, before:0, after:5, between:2}}, Notice that both the after and the between are wrong here. The actual buckets do correctly represent the right values but I would expect between to be 5 and after to be 13. There appears to be a recently fixed issue ( https://issues.apache.org/jira/browse/SOLR-6154) with range facet in distributed queries but it was related to buckets not always appearing with mincount=1 for the field. This looks like it is a different problem. Anyone have any suggestions or notice anythign wrong with my query parameters? I can open a Jira ticket but wanted to run it by the larger audience first to see if I am missing anything obvious. Thanks, Will
generate uuid/ id for table which do not have any primary key
How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Re: generate uuid/ id for table which do not have any primary key
You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Re: generate uuid/ id for table which do not have any primary key
This seems relevant: http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid Best, Erick On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote: You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid