Re: [ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released
Awesome news. Thanks. *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email: sebastian.rami...@senseta.com www.senseta.com On Fri, Feb 20, 2015 at 3:55 PM, Anshum Gupta ans...@anshumgupta.net wrote: 20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 5.0 Release Highlights: * Usability improvements that include improved bin scripts and new and restructured examples. * Scripts to support installing and running Solr as a service on Linux. * Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same: * LocalStatsCache: Local document stats. * ExactStatsCache: One time use aggregation * ExactSharedStatsCache: Stats shared across requests * LRUStatsCache: Stats shared in an LRU cache across requests * Solr will no longer ship a war file and instead be a downloadable application. * SolrJ now has first class support for Collections API. * Implicit registration of replication,get and admin handlers. * Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay. * API for managing blobs allows uploading request handler jars and registering them via config API. * BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties. * There's now an option to not shuffle the nodeSet provided during collection creation. * Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth. * Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward. * timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry. * pivot.facet results can now include nested stats.field results constrained by those pivots. * stats.field can be used to generate stats over the results of arbitrary numeric functions. It also allows for requesting for statistics for pivot facets using tags. * A new DateRangeField has been added for indexing date ranges, especially multi-valued ones. * Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead. * MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode. * Logging improvements: * Transaction log replay status is now logged * Optional logging of slow requests. Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Detailed change log: http://lucene.apache.org/solr/5_0_0/changes/Changes.html Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page PDF serves as the definitive user's manual for Solr 5.0. It can be downloaded from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. -- Anshum Gupta http://about.me/anshumgupta -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete
Re: [ANN] Lucidworks Fusion 1.0.0
It's good to know you'll talk about it at Lucene/Solr Revolution 2014 too. *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Email: sebastian.rami...@senseta.com www.senseta.com On Wed, Sep 24, 2014 at 6:13 AM, Grant Ingersoll gsing...@apache.org wrote: Hi Thomas, Thanks for the question, yes, I give a brief demo of it in action during my talk and we will have demos at our booth. I will also give a demo during the Webinar, which will be recorded. As others have said as well, you can simply download it and try yourself. Cheers, Grant On Sep 23, 2014, at 2:00 AM, Thomas Egense thomas.ege...@gmail.com wrote: Hi Grant. Will there be a Fusion demostration/presentation at Lucene/Solr Revolution DC? (Not listed in the program yet). Thomas Egense On Mon, Sep 22, 2014 at 3:45 PM, Grant Ingersoll gsing...@apache.org wrote: Hi All, We at Lucidworks are pleased to announce the release of Lucidworks Fusion 1.0. Fusion is built to overlay on top of Solr (in fact, you can manage multiple Solr clusters -- think QA, staging and production -- all from our Admin).In other words, if you already have Solr, simply point Fusion at your instance and get all kinds of goodies like Banana ( https://github.com/LucidWorks/Banana -- our port of Kibana to Solr + a number of extensions that Kibana doesn't have), collaborative filtering style recommendations (without the need for Hadoop or Mahout!), a modern signal capture framework, analytics, NLP integration, Boosting/Blocking and other relevance tools, flexible index and query time pipelines as well as a myriad of connectors ranging from Twitter to web crawling to Sharepoint. The best part of all this? It all leverages the infrastructure that you know and love: Solr. Want recommendations? Deploy more Solr. Want log analytics? Deploy more Solr. Want to track important system metrics? Deploy more Solr. Fusion represents our commitment as a company to continue to contribute a large quantity of enhancements to the core of Solr while complementing and extending those capabilities with value adds that integrate a number of 3rd party (e.g connectors) and home grown capabilities like an all new, responsive UI built in AngularJS. Fusion is not a fork of Solr. We do not hide Solr in any way. In fact, our goal is that your existing applications will work out of the box with Fusion, allowing you to take advantage of new capabilities w/o overhauling your existing application. If you want to learn more, please feel free to join our technical webinar on October 2: http://lucidworks.com/blog/say-hello-to-lucidworks-fusion/. If you'd like to download: http://lucidworks.com/product/fusion/. Cheers, Grant Ingersoll Grant Ingersoll | CTO gr...@lucidworks.com | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Is it possible to cluster on search results but return only clusters?
I have this query / URL http://example.com:8983/solr/collection1/clustering?q=%28title:%22+Atlantis%22+~100+OR+content:%22+Atlantis%22+~100%29rows=3001carrot.snippet=contentcarrot.title=titlewt=xmlindent=truesort=date+DESC; With that, I get the results and also the clustering of those results. What I want is just the clusters of the results, not the results, because returning the results is consuming too much bandwidth. I know I can write a proxy script that gets the response from Solr and then filters out the results and returns the clusters, but I first wanna check if it's possible with just the parameters of Solr or Carrot. Thanks in advance, *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 99 No. 14 - 76 Piso 5 Email: sebastian.rami...@senseta.com www.senseta.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: Why do people want to deploy to Tomcat?
I agree with Doug, when I started I had to spend some time figuring out what was just an example and what I would have to change in a production environment... until I found that all the example was ready for production. Of course, you commonly have to change the settings, parameters, fields, etc. of your Solr system, but the example doesn't have anything that is not for production. Sebastián Ramírez [image: SENSETA – Capture Analyze] http://www.senseta.com/ On Tue, Nov 12, 2013 at 8:18 AM, Amit Aggarwal amit.aggarwa...@gmail.comwrote: Agreed with Doug On 12-Nov-2013 6:46 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: As an aside, I think one reason people feel compelled to deviate from the distributed jetty distribution is because the folder is named example. I've had to explain to a few clients that this is a bit of a misnomer. The IT dept especially sees example and feels uncomfortable using that as a starting point for a jetty install. I wish it was called default or bin or something where its more obviously the default jetty distribution of Solr. On Tue, Nov 12, 2013 at 7:06 AM, Roland Everaert reveatw...@gmail.com wrote: In my case, the first time I had to deploy and configure solr on tomcat (and jboss) it was a requirement to reuse as much as possible the application/web server already in place. The next deployment I also use tomcat, because I was used to deploy on tomcat and I don't know jetty at all. I could ask the same question with regard to jetty. Why use/bundle(/ if not recommend) jetty with solr over other webserver solutions? Regards, Roland Everaert. On Tue, Nov 12, 2013 at 12:33 PM, Alvaro Cabrerizo topor...@gmail.com wrote: In my case, the selection of the servlet container has never been a hard requirement. I mean, some customers provide us a virtual machine configured with java/tomcat , others have a tomcat installed and want to share it with solr, others prefer jetty because their sysadmins are used to configure it... At least in the projects I've been working in, the selection of the servlet engine has not been a key factor in the project success. Regards. On Tue, Nov 12, 2013 at 12:11 PM, Andre Bois-Crettez andre.b...@kelkoo.comwrote: We are using Solr running on Tomcat. I think the top reasons for us are : - we already have nagios monitoring plugins for tomcat that trace queries ok/error, http codes / response time etc in access logs, number of threads, jvm memory usage etc - start, stop, watchdogs, logs : we also use our standard tools for that - what about security filters ? Is that possible with jetty ? André On 11/12/2013 04:54 AM, Alexandre Rafalovitch wrote: Hello, I keep seeing here and on Stack Overflow people trying to deploy Solr to Tomcat. We don't usually ask why, just help when where we can. But the question happens often enough that I am curious. What is the actual business case. Is that because Tomcat is well known? Is it because other apps are running under Tomcat and it is ops' requirement? Is it because Tomcat gives something - to Solr - that Jetty does not? It might be useful to know. Especially, since Solr team is considering making the server part into a black box component. What use cases will that break? So, if somebody runs Solr under Tomcat (or needed to and gave up), let's use this thread to collect this knowledge. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- André Bois-Crettez Software Architect Search Developer http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named
Re: Replica shards not updating their index when update is sent to them
I found how to solve the problem. After sending a file to be indexed to a replica shard (node2): curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbig moth/field/doc/add' I can send a commit param to the same shard and then it gets updated: curl 'http://node2:8983/solr/update?commit=true' Another option is to send, from the beginning, a commitWithin param with some milliseconds instead of a commit directly. That way, the commit happens at most (the milliseconds specified) after, but the changes get reflected in all shards, including the replica shard that received the update request: curl 'http://node2:8983/solr/update?commitWithin=1http://node2:8983/solr/update?commit=true ' As these emails get archived, I hope this may help someone in the future. Sebastián Ramírez On Mon, May 20, 2013 at 4:32 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: Yes, It's happening with the latest version, 4.2.1 Yes, it's easy to reproduce. It happened using 3 Virtual Machines and also happened using 3 physical nodes. Here are the details: I installed Hortonworks (a Hadoop distribution) in the 3 nodes. That installs Zookeeper. I used the example directory and copied it to the 3 nodes. I start Zookeeper in the 3 nodes. The first time, I run this command on each node, to start Solr: java -jar -Dbootstrap_conf=true -DzkHost='node1,node2,node3' start.jar As I understand, the -Dbootstrap_conf=true uploads the configuration to Zookeeper, so I don't need to do that the following times that I start each SolrCore. So, the following times, I run this on each node: java -jar -DzkHost='node0,node1,node2' start.jar Because I ran that command on node0 first, that node became the leader shard. I send an update to the leader shard, (in this case node0): I run curl 'http://node0:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbuggy/field/doc/add' When I query any shard I get the correct result: I run curl 'http://node0:8983/solr/select?q=id:asdf' or curl 'http://node1:8983/solr/select?q=id:asdf' or curl 'http://node2:8983/solr/select?q=id:asdf' (i.e. I send the query to each node), and then I get the expected response ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc... But when I send an update to a replica shard (node2) it is updated only in the leader shard (node0) and in the other replica (node1), not in the shard that received the update (node2): I send an update to the replica node2, I run curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbig moth/field/doc/add' Then I query each node and I receive the updated results only from the leader shard (node0) and the other replica shard (node1). I run (leader, node0): curl 'http://node0:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (other replica, node1): curl 'http://node1:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (first replica, the one that received the update, node2): curl 'http://node2:8983/solr/select?q=id:asdf' And I get (old result): ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc ... Thanks for your interest, Sebastián Ramírez On Mon, May 20, 2013 at 3:30 PM, Yonik Seeley yo...@lucidworks.comwrote: On Mon, May 20, 2013 at 4:21 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: When I send an update to a non-leader (replica) shard (B), the updated results are reflected in the leader shard (A) and in the other replica shard (C), but not in the shard that received the update (B). I've never seen that before. The replica that received the update isn't treated as special in any way by the code, so it's not clear how this could happen. What version of Solr is this (and does it happen with the latest version)? How easy is this to reproduce for you? -Yonik http://lucidworks.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Replica shards not updating their index when update is sent to them
Hello, I'm having a little problem with a test SolrCloud cluster. I've set up 3 nodes (SolrCores) to use an external Zookeeper. I use 1 shard and the other 2 SolrCores are being auto-asigned as replicas. Let's say I have these 3 nodes: the leader shard A, the replica shard B, and the (other) replica shard C. I can send queries to any node (A, B or C) and I get the results. I can send updates to the leader shard (A) and get correct (updated) results in any of the 3 shards (A, B, or C). * Here is the problem: When I send an update to a non-leader (replica) shard (B), the updated results are reflected in the leader shard (A) and in the other replica shard (C), but not in the shard that received the update (B). I can do this same process, send the update to the other non-leader shard (C), and the same happens, I get the results in the leader (A) and in the other replica shard (B), but not in the shard that received the update (C). Any suggestion? Thanks! Sebastián Ramírez -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: Replica shards not updating their index when update is sent to them
Yes, It's happening with the latest version, 4.2.1 Yes, it's easy to reproduce. It happened using 3 Virtual Machines and also happened using 3 physical nodes. Here are the details: I installed Hortonworks (a Hadoop distribution) in the 3 nodes. That installs Zookeeper. I used the example directory and copied it to the 3 nodes. I start Zookeeper in the 3 nodes. The first time, I run this command on each node, to start Solr: java -jar -Dbootstrap_conf=true -DzkHost='node1,node2,node3' start.jar As I understand, the -Dbootstrap_conf=true uploads the configuration to Zookeeper, so I don't need to do that the following times that I start each SolrCore. So, the following times, I run this on each node: java -jar -DzkHost='node0,node1,node2' start.jar Because I ran that command on node0 first, that node became the leader shard. I send an update to the leader shard, (in this case node0): I run curl 'http://node0:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbuggy/field/doc/add' When I query any shard I get the correct result: I run curl 'http://node0:8983/solr/select?q=id:asdf' or curl 'http://node1:8983/solr/select?q=id:asdf' or curl 'http://node2:8983/solr/select?q=id:asdf' (i.e. I send the query to each node), and then I get the expected response ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc... But when I send an update to a replica shard (node2) it is updated only in the leader shard (node0) and in the other replica (node1), not in the shard that received the update (node2): I send an update to the replica node2, I run curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type: text/xml' --data-binary 'adddocfield name=idasdf/fieldfield name=contentbig moth/field/doc/add' Then I query each node and I receive the updated results only from the leader shard (node0) and the other replica shard (node1). I run (leader, node0): curl 'http://node0:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (other replica, node1): curl 'http://node1:8983/solr/select?q=id:asdf' And I get: ... docstr name=idasdf/strarr name=content strbig moth/str /arr ... /doc ... I run (first replica, the one that received the update, node2): curl 'http://node2:8983/solr/select?q=id:asdf' And I get (old result): ... docstr name=idasdf/strarr name=content strbuggy/str /arr ... /doc ... Thanks for your interest, Sebastián Ramírez On Mon, May 20, 2013 at 3:30 PM, Yonik Seeley yo...@lucidworks.com wrote: On Mon, May 20, 2013 at 4:21 PM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: When I send an update to a non-leader (replica) shard (B), the updated results are reflected in the leader shard (A) and in the other replica shard (C), but not in the shard that received the update (B). I've never seen that before. The replica that received the update isn't treated as special in any way by the code, so it's not clear how this could happen. What version of Solr is this (and does it happen with the latest version)? How easy is this to reproduce for you? -Yonik http://lucidworks.com -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Hello everyone, I'm having a problem indexing content from opendocument format files. The files created with OpenOffice and LibreOffice (odt, ods...). Tika is being able to read the files but Solr is not indexing the content. It's not a problem of commiting or something like that, after I post a file it is indexed and all the metadata is indexed/stored but the content isn't there. - I modified the solrconfig.xml file to catch everything: requestHandler name=/update/extract... !-- here is the interesting part -- !-- str name=uprefixignored_/str -- str name=defaultFieldall_txt/str - Then I submitted the file to Solr: curl ' http://localhost:8983/solr/update/extract?commit=trueliteral.id=newods' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - Now when I do a search in Solr I get this result, there is something in the content, but that's not the actual content of the original file: result name=response numFound=1 start=0 doc str name=idnewods/str arr name=all_txt str1/str str2013-05-03T10:02:10.58/str str2013-05-03T10:02:50.54/str str2013-05-03T10:02:50.54/str str1/str str2013-05-03T10:02:10.58/str str1/str str2013-05-03T10:02:50.54/str str2013-05-03T10:02:50.54/str str0/str strP0D/str str2013-05-03T10:02:10.58/str str1/str str0/str strapplication/ods/str str0/str str7322/str strLibreOffice/4.0.2.2$Windows_x86 LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3/str str2013-05-03T10:02:50.54/str /arr date name=last_modified2013-05-03T10:02:50Z/date arr name=content_type strapplication/vnd.oasis.opendocument.spreadsheet/str /arr arr name=content str ??? Page ??? (???) 00/00/, 00:00:00 Page //str /arr long name=_version_1434658995848609792/long/doc/result/response - I ask Solr to show me the extracted content from Tika doing this: curl 'http://localhost:8983/solr/update/extract?extractOnly=true' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - And I get the XHTML extracted from Tika, including the original file contents and that final part that Solr is indeed indexing, so, Tika is being able to read the file but Solr is not indexing the real content, it only indexes the rest: body table tr td ptest/p /td /tr tr td pde/p /td /tr tr td pods/p /td /tr /table p xmlns=http://www.w3.org/1999/xhtml;???/p pPage/p p??? (???)/p p00/00/, 00:00:00/p pPage / /p /body Do any of you know how to fix/workaround this problem? Thanks! Sebastián Ramírez -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Thanks for your reply Jack! First: LOL Second: I'm using the latest version of libreoffice, but with the extractOnly param in the Solr request it shows the content of the file so Tika is being able to read and extract the data but Solr isn't indexing that data. Third: I already did that with no luck, I tried application/vnd.oasis.opendocument.spreadsheet, application/ods and application/octet-stream but always got the same result. Following the documentation for ExtractingRequestHandlerhttp://wiki.apache.org/solr/ExtractingRequestHandler#Concepts I see that Tika reads the file and feeds it to a SAX ContentHandler, and Solr then reacts to Tika's SAX events and creates the fields to index. I think that the problem might be somewhere in that process of feeding the SAX ContentHandler or the reaction of Solr to those SAX events. Do you (or anyone else) know how could one configure / debug that SAX ContentHandler? Thanks, Sebastián Ramírez On Fri, May 10, 2013 at 10:57 AM, Jack Krupansky j...@basetechnology.comwrote: Switching to Microsoft Office will probably solve your problem! Sorry, I couldn't resist. Are you using a really new or really old version of the ODT/ODS software? I mean, maybe Tika doesn't have support for that version. Check the mime type that Tika generates - maybe you just need to override it to force Tika to use the proper format. -- Jack Krupansky -Original Message- From: Sebastián Ramírez Sent: Friday, May 10, 2013 11:24 AM To: solr-user@lucene.apache.org Subject: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1 Hello everyone, I'm having a problem indexing content from opendocument format files. The files created with OpenOffice and LibreOffice (odt, ods...). Tika is being able to read the files but Solr is not indexing the content. It's not a problem of commiting or something like that, after I post a file it is indexed and all the metadata is indexed/stored but the content isn't there. - I modified the solrconfig.xml file to catch everything: requestHandler name=/update/extract... !-- here is the interesting part -- !-- str name=uprefixignored_/str -- str name=defaultFieldall_txt/**str - Then I submitted the file to Solr: curl ' http://localhost:8983/solr/**update/extract?commit=true** literal.id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods' -H 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' --data-binary @test_ods.ods - Now when I do a search in Solr I get this result, there is something in the content, but that's not the actual content of the original file: result name=response numFound=1 start=0 doc str name=idnewods/str arr name=all_txt str1/str str2013-05-03T10:02:10.58/**str str2013-05-03T10:02:50.54/**str str2013-05-03T10:02:50.54/**str str1/str str2013-05-03T10:02:10.58/**str str1/str str2013-05-03T10:02:50.54/**str str2013-05-03T10:02:50.54/**str str0/str strP0D/str str2013-05-03T10:02:10.58/**str str1/str str0/str strapplication/ods/str str0/str str7322/str strLibreOffice/4.0.2.2$**Windows_x86 LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3/str str2013-05-03T10:02:50.54/**str /arr date name=last_modified2013-05-**03T10:02:50Z/date arr name=content_type strapplication/vnd.oasis.**opendocument.spreadsheet/str /arr arr name=content str ??? Page ??? (???) 00/00/, 00:00:00 Page //str /arr long name=_version_**1434658995848609792/long/** doc/result/response - I ask Solr to show me the extracted content from Tika doing this: curl 'http://localhost:8983/solr/**update/extract?extractOnly=**truehttp://localhost:8983/solr/update/extract?extractOnly=true' -H 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' --data-binary @test_ods.ods - And I get the XHTML extracted from Tika, including the original file contents and that final part that Solr is indeed indexing, so, Tika is being able to read the file but Solr is not indexing the real content, it only indexes the rest: body table tr td ptest/p /td /tr tr td pde/p /td /tr tr td pods/p /td /tr /table p xmlns=http://www.w3.org/1999/**xhtml http://www.w3.org/1999/xhtml ???/p pPage/p p??? (???)/p p00/00/, 00:00:00/p pPage / /p /body Do any of you know how to fix/workaround this problem? Thanks! Sebastián Ramírez -- *-**---* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named
Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Thanks Walter and Alex, You are right Walter. In fact, if I'm not wrong, Tika doesn't use an externar parser for those formats as it does with MS Office files or PDFs, it uses java ZIP and XML libraries to parse those files directly. I guess that would be my last resort. But I would certainly like if I was able to make Tika process my files without the overhead of building a kind of complicated program that extracts the contents of the file while, maybe, Tika could do that for me. I think that could be very related Alex. I don't know exactly what the mapper does, but what you describe seems quite similar. I'm being able to generate the XHTML from Tika with the original document content, but Solr doesn't index that content from the XHTML. So, maybe it's a bug in Solr cell / ExtractingRequestHandler / Tika, right? Thanks, Sebastián Ramírez On Fri, May 10, 2013 at 1:59 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: On Fri, May 10, 2013 at 11:24 AM, Sebastián Ramírez sebastian.rami...@senseta.com wrote: Hello everyone, I'm having a problem indexing content from opendocument format files. The files created with OpenOffice and LibreOffice (odt, ods...). I wonder if it is connected to https://issues.apache.org/jira/browse/SOLR-4530 where the default Tika mapper actually keeps very little of the XHTML it gets. I fixed it for DIH in 4.3, but haven't looked at the CELL yet. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
Many thanks Jack for your attention and effort on solving the problem. Best, Sebastián Ramírez On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky j...@basetechnology.comwrote: I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail to index the proper content, both for .ODP and .ODT files. If I do extractOnly=true**extractFormat=text, I see the extracted text clearly in addition to the metadata. I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the problem. I just see spaces in both cases. But whether the problem is due to Solr or Tika, is not apparent. In any case, a Jira is warranted. -- Jack Krupansky -Original Message- From: Sebastián Ramírez Sent: Friday, May 10, 2013 11:24 AM To: solr-user@lucene.apache.org Subject: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1 Hello everyone, I'm having a problem indexing content from opendocument format files. The files created with OpenOffice and LibreOffice (odt, ods...). Tika is being able to read the files but Solr is not indexing the content. It's not a problem of commiting or something like that, after I post a file it is indexed and all the metadata is indexed/stored but the content isn't there. - I modified the solrconfig.xml file to catch everything: requestHandler name=/update/extract... !-- here is the interesting part -- !-- str name=uprefixignored_/str -- str name=defaultFieldall_txt/**str - Then I submitted the file to Solr: curl ' http://localhost:8983/solr/**update/extract?commit=true** literal.id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods' -H 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' --data-binary @test_ods.ods - Now when I do a search in Solr I get this result, there is something in the content, but that's not the actual content of the original file: result name=response numFound=1 start=0 doc str name=idnewods/str arr name=all_txt str1/str str2013-05-03T10:02:10.58/**str str2013-05-03T10:02:50.54/**str str2013-05-03T10:02:50.54/**str str1/str str2013-05-03T10:02:10.58/**str str1/str str2013-05-03T10:02:50.54/**str str2013-05-03T10:02:50.54/**str str0/str strP0D/str str2013-05-03T10:02:10.58/**str str1/str str0/str strapplication/ods/str str0/str str7322/str strLibreOffice/4.0.2.2$**Windows_x86 LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3/str str2013-05-03T10:02:50.54/**str /arr date name=last_modified2013-05-**03T10:02:50Z/date arr name=content_type strapplication/vnd.oasis.**opendocument.spreadsheet/str /arr arr name=content str ??? Page ??? (???) 00/00/, 00:00:00 Page //str /arr long name=_version_**1434658995848609792/long/** doc/result/response - I ask Solr to show me the extracted content from Tika doing this: curl 'http://localhost:8983/solr/**update/extract?extractOnly=**truehttp://localhost:8983/solr/update/extract?extractOnly=true' -H 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' --data-binary @test_ods.ods - And I get the XHTML extracted from Tika, including the original file contents and that final part that Solr is indeed indexing, so, Tika is being able to read the file but Solr is not indexing the real content, it only indexes the rest: body table tr td ptest/p /td /tr tr td pde/p /td /tr tr td pods/p /td /tr /table p xmlns=http://www.w3.org/1999/**xhtml http://www.w3.org/1999/xhtml ???/p pPage/p p??? (???)/p p00/00/, 00:00:00/p pPage / /p /body Do any of you know how to fix/workaround this problem? Thanks! Sebastián Ramírez -- *-**---* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.* -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*
Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1
OK Jack, I'll switch to MS Office ...hahaha Many thanks for your interest and help... and the bug report in JIRA. Best, Sebastián Ramírez On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky j...@basetechnology.comwrote: I filed SOLR-4809 - OpenOffice document body is not indexed by SolrCell, including some test files. https://issues.apache.org/**jira/browse/SOLR-4809https://issues.apache.org/jira/browse/SOLR-4809 Yeah, at this stage, switching to Microsoft Office seems like the best bet! -- Jack Krupansky -Original Message- From: Sebastián Ramírez Sent: Friday, May 10, 2013 6:33 PM To: solr-user@lucene.apache.org Subject: Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1 Many thanks Jack for your attention and effort on solving the problem. Best, Sebastián Ramírez On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky j...@basetechnology.com* *wrote: I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail to index the proper content, both for .ODP and .ODT files. If I do extractOnly=trueextractFormat=text, I see the extracted text clearly in addition to the metadata. I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the problem. I just see spaces in both cases. But whether the problem is due to Solr or Tika, is not apparent. In any case, a Jira is warranted. -- Jack Krupansky -Original Message- From: Sebastián Ramírez Sent: Friday, May 10, 2013 11:24 AM To: solr-user@lucene.apache.org Subject: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1 Hello everyone, I'm having a problem indexing content from opendocument format files. The files created with OpenOffice and LibreOffice (odt, ods...). Tika is being able to read the files but Solr is not indexing the content. It's not a problem of commiting or something like that, after I post a file it is indexed and all the metadata is indexed/stored but the content isn't there. - I modified the solrconfig.xml file to catch everything: requestHandler name=/update/extract... !-- here is the interesting part -- !-- str name=uprefixignored_/str -- str name=defaultFieldall_txt/str - Then I submitted the file to Solr: curl ' http://localhost:8983/solr/update/extract?commit=true**http://localhost:8983/solr/**update/extract?commit=true** literal.id=newodshttp://**localhost:8983/solr/update/** extract?commit=trueliteral.**id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods ' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - Now when I do a search in Solr I get this result, there is something in the content, but that's not the actual content of the original file: result name=response numFound=1 start=0 doc str name=idnewods/str arr name=all_txt str1/str str2013-05-03T10:02:10.58/str str2013-05-03T10:02:50.54/str str2013-05-03T10:02:50.54/str str1/str str2013-05-03T10:02:10.58/str str1/str str2013-05-03T10:02:50.54/str str2013-05-03T10:02:50.54/str str0/str strP0D/str str2013-05-03T10:02:10.58/str str1/str str0/str strapplication/ods/str str0/str str7322/str strLibreOffice/4.0.2.2$Windows_x86 LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3/str str2013-05-03T10:02:50.54/str /arr date name=last_modified2013-05-03T10:02:50Z/date arr name=content_type strapplication/vnd.oasis.opendocument.spreadsheet/str /arr arr name=content str ??? Page ??? (???) 00/00/, 00:00:00 Page //str /arr long name=_version_1434658995848609792/long/** doc/result/response - I ask Solr to show me the extracted content from Tika doing this: curl 'http://localhost:8983/solr/update/extract?extractOnly=truehttp://localhost:8983/solr/**update/extract?extractOnly=**true http://localhost:8983/**solr/update/extract?**extractOnly=truehttp://localhost:8983/solr/update/extract?extractOnly=true ' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - And I get the XHTML extracted from Tika, including the original file contents and that final part that Solr is indeed indexing, so, Tika is being able to read the file but Solr is not indexing the real content, it only indexes the rest: body table tr td ptest/p /td /tr tr td pde/p /td /tr tr td pods/p /td /tr /table p xmlns=http://www.w3.org/1999/xhtmlhttp://www.w3.org/1999/**xhtml http://www.w3.org/1999/xhtml ???/p pPage/p p??? (???)/p p00/00/, 00:00:00/p pPage / /p /body Do any of you know how to fix/workaround this problem? Thanks