Re: [ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released

2015-02-24 Thread Sebastián Ramírez
Awesome news. Thanks.


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo https://twitter.com/tiangolo
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Fri, Feb 20, 2015 at 3:55 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0
 available

 The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.

 Solr 5.0 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 5.0 Release Highlights:

  * Usability improvements that include improved bin scripts and new and
 restructured examples.

  * Scripts to support installing and running Solr as a service on Linux.

  * Distributed IDF is now supported and can be enabled via the config.
 Currently, there are four supported implementations for the same:
 * LocalStatsCache: Local document stats.
 * ExactStatsCache: One time use aggregation
 * ExactSharedStatsCache: Stats shared across requests
 * LRUStatsCache: Stats shared in an LRU cache across requests

  * Solr will no longer ship a war file and instead be a downloadable
 application.

  * SolrJ now has first class support for Collections API.

  * Implicit registration of replication,get and admin handlers.

  * Config API that supports paramsets for easily configuring solr
 parameters and configuring fields. This API also supports managing of
 pre-existing request handlers and editing common solrconfig.xml via
 overlay.

  * API for managing blobs allows uploading request handler jars and
 registering them via config API.

  * BALANCESHARDUNIQUE Collection API that allows for even distribution of
 custom replica properties.

  * There's now an option to not shuffle the nodeSet provided during
 collection creation.

  * Option to configure bandwidth usage by Replication handler to prevent it
 from using up all the bandwidth.

  * Splitting of clusterstate to per-collection enables scalability
 improvement in SolrCloud. This is also the default format for new
 Collections that would be created going forward.

  * timeAllowed is now used to prematurely terminate requests during query
 expansion and SolrClient request retry.

  * pivot.facet results can now include nested stats.field results
 constrained by those pivots.

  * stats.field can be used to generate stats over the results of arbitrary
 numeric functions.
 It also allows for requesting for statistics for pivot facets using tags.

  * A new DateRangeField has been added for indexing date ranges, especially
 multi-valued ones.

  * Spatial fields that used to require units=degrees now take
 distanceUnits=degrees/kilometers miles instead.

  * MoreLikeThis query parser allows requesting for documents similar to an
 existing document and also works in SolrCloud mode.

  * Logging improvements:
 * Transaction log replay status is now logged
 * Optional logging of slow requests.

 Solr 5.0 also includes many other new features as well as numerous
 optimizations and bugfixes of the corresponding Apache Lucene release.

 Detailed change log:
 http://lucene.apache.org/solr/5_0_0/changes/Changes.html

 Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page
 PDF serves as the definitive user's manual for Solr 5.0. It can be
 downloaded
 from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.

 --
 Anshum Gupta
 http://about.me/anshumgupta


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete

Re: [ANN] Lucidworks Fusion 1.0.0

2014-09-24 Thread Sebastián Ramírez
It's good to know you'll talk about it at Lucene/Solr Revolution 2014 too.


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Wed, Sep 24, 2014 at 6:13 AM, Grant Ingersoll gsing...@apache.org
wrote:

 Hi Thomas,

 Thanks for the question, yes, I give a brief demo of it in action during
 my talk and we will have demos at our booth.  I will also give a demo
 during the Webinar, which will be recorded.  As others have said as well,
 you can simply download it and try yourself.

 Cheers,
 Grant

 On Sep 23, 2014, at 2:00 AM, Thomas Egense thomas.ege...@gmail.com
 wrote:

  Hi Grant.
  Will there be a Fusion demostration/presentation  at Lucene/Solr
 Revolution
  DC? (Not listed in the program yet).
 
 
  Thomas Egense
 
  On Mon, Sep 22, 2014 at 3:45 PM, Grant Ingersoll gsing...@apache.org
  wrote:
 
  Hi All,
 
  We at Lucidworks are pleased to announce the release of Lucidworks
 Fusion
  1.0.   Fusion is built to overlay on top of Solr (in fact, you can
 manage
  multiple Solr clusters -- think QA, staging and production -- all from
 our
  Admin).In other words, if you already have Solr, simply point
 Fusion at
  your instance and get all kinds of goodies like Banana (
  https://github.com/LucidWorks/Banana -- our port of Kibana to Solr + a
  number of extensions that Kibana doesn't have), collaborative filtering
  style recommendations (without the need for Hadoop or Mahout!), a modern
  signal capture framework, analytics, NLP integration, Boosting/Blocking
 and
  other relevance tools, flexible index and query time pipelines as well
 as a
  myriad of connectors ranging from Twitter to web crawling to Sharepoint.
  The best part of all this?  It all leverages the infrastructure that you
  know and love: Solr.  Want recommendations?  Deploy more Solr.  Want log
  analytics?  Deploy more Solr.  Want to track important system metrics?
  Deploy more Solr.
 
  Fusion represents our commitment as a company to continue to contribute
 a
  large quantity of enhancements to the core of Solr while complementing
 and
  extending those capabilities with value adds that integrate a number of
 3rd
  party (e.g connectors) and home grown capabilities like an all new,
  responsive UI built in AngularJS.  Fusion is not a fork of Solr.  We do
 not
  hide Solr in any way.  In fact, our goal is that your existing
 applications
  will work out of the box with Fusion, allowing you to take advantage of
 new
  capabilities w/o overhauling your existing application.
 
  If you want to learn more, please feel free to join our technical
 webinar
  on October 2:
 http://lucidworks.com/blog/say-hello-to-lucidworks-fusion/.
  If you'd like to download: http://lucidworks.com/product/fusion/.
 
  Cheers,
  Grant Ingersoll
 
  
  Grant Ingersoll | CTO
  gr...@lucidworks.com | @gsingers
  http://www.lucidworks.com
 
 

 
 Grant Ingersoll | @gsingers
 http://www.lucidworks.com







-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Is it possible to cluster on search results but return only clusters?

2014-05-06 Thread Sebastián Ramírez
I have this query / URL

http://example.com:8983/solr/collection1/clustering?q=%28title:%22+Atlantis%22+~100+OR+content:%22+Atlantis%22+~100%29rows=3001carrot.snippet=contentcarrot.title=titlewt=xmlindent=truesort=date+DESC;

With that, I get the results and also the clustering of those results. What
I want is just the clusters of the results, not the results, because
returning the results is consuming too much bandwidth.

I know I can write a proxy script that gets the response from Solr and
then filters out the results and returns the clusters, but I first wanna
check if it's possible with just the parameters of Solr or Carrot.

Thanks in advance,


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 99 No. 14 - 76 Piso 5
 Email: sebastian.rami...@senseta.com
 www.senseta.com

-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: Why do people want to deploy to Tomcat?

2013-11-12 Thread Sebastián Ramírez
I agree with Doug, when I started I had to spend some time figuring out
what was just an example and what I would have to change in a
production environment... until I found that all the example was ready
for production.

Of course, you commonly have to change the settings, parameters, fields,
etc. of your Solr system, but the example doesn't have anything that is
not for production.


Sebastián Ramírez
[image: SENSETA – Capture  Analyze] http://www.senseta.com/


On Tue, Nov 12, 2013 at 8:18 AM, Amit Aggarwal amit.aggarwa...@gmail.comwrote:

 Agreed with Doug
 On 12-Nov-2013 6:46 PM, Doug Turnbull 
 dturnb...@opensourceconnections.com
 wrote:

  As an aside, I think one reason people feel compelled to deviate from the
  distributed jetty distribution is because the folder is named example.
  I've had to explain to a few clients that this is a bit of a misnomer.
 The
  IT dept especially sees example and feels uncomfortable using that as a
  starting point for a jetty install. I wish it was called default or
 bin
  or something where its more obviously the default jetty distribution of
  Solr.
 
 
  On Tue, Nov 12, 2013 at 7:06 AM, Roland Everaert reveatw...@gmail.com
  wrote:
 
   In my case, the first time I had to deploy and configure solr on tomcat
   (and jboss) it was a requirement to reuse as much as possible the
   application/web server already in place. The next deployment I also use
   tomcat, because I was used to deploy on tomcat and I don't know jetty
 at
   all.
  
   I could ask the same question with regard to jetty. Why use/bundle(/ if
  not
   recommend) jetty with solr over other webserver solutions?
  
   Regards,
  
  
   Roland Everaert.
  
  
  
   On Tue, Nov 12, 2013 at 12:33 PM, Alvaro Cabrerizo topor...@gmail.com
   wrote:
  
In my case, the selection of the servlet container has never been a
  hard
requirement. I mean, some customers provide us a virtual machine
   configured
with java/tomcat , others have a tomcat installed and want to share
 it
   with
solr, others prefer jetty because their sysadmins are used to
 configure
it...  At least in the projects I've been working in, the selection
 of
   the
servlet engine has not been a key factor in the project success.
   
Regards.
   
   
On Tue, Nov 12, 2013 at 12:11 PM, Andre Bois-Crettez
andre.b...@kelkoo.comwrote:
   
 We are using Solr running on Tomcat.

 I think the top reasons for us are :
  - we already have nagios monitoring plugins for tomcat that trace
 queries ok/error, http codes / response time etc in access logs,
  number
 of threads, jvm memory usage etc
  - start, stop, watchdogs, logs : we also use our standard tools
 for
   that
  - what about security filters ? Is that possible with jetty ?

 André


 On 11/12/2013 04:54 AM, Alexandre Rafalovitch wrote:

 Hello,

 I keep seeing here and on Stack Overflow people trying to deploy
  Solr
   to
 Tomcat. We don't usually ask why, just help when where we can.

 But the question happens often enough that I am curious. What is
 the
 actual
 business case. Is that because Tomcat is well known? Is it because
   other
 apps are running under Tomcat and it is ops' requirement? Is it
   because
 Tomcat gives something - to Solr - that Jetty does not?

 It might be useful to know. Especially, since Solr team is
  considering
 making the server part into a black box component. What use cases
  will
 that
 break?

 So, if somebody runs Solr under Tomcat (or needed to and gave up),
   let's
 use this thread to collect this knowledge.

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening
 all
   at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via
 GTD
book)

 --
 André Bois-Crettez

 Software Architect
 Search Developer
 http://www.kelkoo.com/


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.

   
  
 
 
 
  --
  Doug Turnbull
  Search  Big Data Architect
  OpenSource Connections http://o19s.com
 


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named

Re: Replica shards not updating their index when update is sent to them

2013-05-29 Thread Sebastián Ramírez
I found how to solve the problem.

After sending a file to be indexed to a replica shard (node2):

curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type:
text/xml' --data-binary 'adddocfield name=idasdf/fieldfield
name=contentbig moth/field/doc/add'

I can send a commit param to the same shard and then it gets updated:

curl 'http://node2:8983/solr/update?commit=true'


Another option is to send, from the beginning, a commitWithin param with
some milliseconds instead of a commit directly. That way, the commit
happens at most (the milliseconds specified) after, but the changes get
reflected in all shards, including the replica shard that received the
update request:

curl 
'http://node2:8983/solr/update?commitWithin=1http://node2:8983/solr/update?commit=true
'


As these emails get archived, I hope this may help someone in the future.

Sebastián Ramírez


On Mon, May 20, 2013 at 4:32 PM, Sebastián Ramírez 
sebastian.rami...@senseta.com wrote:

 Yes, It's happening with the latest version, 4.2.1

 Yes, it's easy to reproduce.
 It happened using 3 Virtual Machines and also happened using 3 physical
 nodes.


 Here are the details:

 I installed Hortonworks (a Hadoop distribution) in the 3 nodes. That
 installs Zookeeper.

 I used the example directory and copied it to the 3 nodes.

 I start Zookeeper in the 3 nodes.

 The first time, I run this command on each node, to start Solr:  java
 -jar -Dbootstrap_conf=true -DzkHost='node1,node2,node3'  start.jar

 As I understand, the -Dbootstrap_conf=true uploads the configuration to
 Zookeeper, so I don't need to do that the following times that I start each
 SolrCore.

 So, the following times, I run this on each node: java -jar
 -DzkHost='node0,node1,node2' start.jar

 Because I ran that command on node0 first, that node became the leader
 shard.

 I send an update to the leader shard, (in this case node0):
 I run curl 'http://node0:8983/solr/update?commit=true' -H 'Content-type:
 text/xml' --data-binary 'adddocfield name=idasdf/fieldfield
 name=contentbuggy/field/doc/add'

 When I query any shard I get the correct result:
 I run curl 'http://node0:8983/solr/select?q=id:asdf'
 or curl 'http://node1:8983/solr/select?q=id:asdf'
 or curl 'http://node2:8983/solr/select?q=id:asdf'
 (i.e. I send the query to each node), and then I get the expected response ...
 docstr name=idasdf/strarr name=content strbuggy/str /arr
 ... /doc...

 But when I send an update to a replica shard (node2) it is updated only in
 the leader shard (node0) and in the other replica (node1), not in the shard
 that received the update (node2):
 I send an update to the replica node2,
 I run curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type:
 text/xml' --data-binary 'adddocfield name=idasdf/fieldfield
 name=contentbig moth/field/doc/add'

 Then I query each node and I receive the updated results only from the
 leader shard (node0) and the other replica shard (node1).

 I run (leader, node0):
 curl 'http://node0:8983/solr/select?q=id:asdf'
 And I get:
 ... docstr name=idasdf/strarr name=content strbig moth/str
 /arr ... /doc ...

 I run (other replica, node1):
 curl 'http://node1:8983/solr/select?q=id:asdf'
 And I get:
 ... docstr name=idasdf/strarr name=content strbig moth/str
 /arr ... /doc ...

 I run (first replica, the one that received the update, node2):
 curl 'http://node2:8983/solr/select?q=id:asdf'
 And I get (old result):
 ... docstr name=idasdf/strarr name=content strbuggy/str
 /arr ... /doc ...

 Thanks for your interest,

 Sebastián Ramírez


 On Mon, May 20, 2013 at 3:30 PM, Yonik Seeley yo...@lucidworks.comwrote:

 On Mon, May 20, 2013 at 4:21 PM, Sebastián Ramírez
 sebastian.rami...@senseta.com wrote:
  When I send an update to a non-leader (replica) shard (B), the updated
  results are reflected in the leader shard (A) and in the other replica
  shard (C), but not in the shard that received the update (B).

 I've never seen that before.  The replica that received the update
 isn't treated as special in any way by the code, so it's not clear how
 this could happen.

 What version of Solr is this (and does it happen with the latest
 version)?  How easy is this to reproduce for you?

 -Yonik
 http://lucidworks.com




-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Replica shards not updating their index when update is sent to them

2013-05-20 Thread Sebastián Ramírez
Hello,

I'm having a little problem with a test SolrCloud cluster.

I've set up 3 nodes (SolrCores) to use an external Zookeeper. I use 1 shard
and the other 2 SolrCores are being auto-asigned as replicas.

Let's say I have these 3 nodes: the leader shard A, the replica shard B,
and the (other) replica shard C.

I can send queries to any node (A, B or C) and I get the results.

I can send updates to the leader shard (A) and get correct (updated)
results in any of the 3 shards (A, B, or C).

* Here is the problem:
When I send an update to a non-leader (replica) shard (B), the updated
results are reflected in the leader shard (A) and in the other replica
shard (C), but not in the shard that received the update (B). I can do this
same process, send the update to the other non-leader shard (C), and the
same happens, I get the results in the leader (A) and in the other replica
shard (B), but not in the shard that received the update (C).

Any suggestion?

Thanks!

Sebastián Ramírez

-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: Replica shards not updating their index when update is sent to them

2013-05-20 Thread Sebastián Ramírez
Yes, It's happening with the latest version, 4.2.1

Yes, it's easy to reproduce.
It happened using 3 Virtual Machines and also happened using 3 physical
nodes.


Here are the details:

I installed Hortonworks (a Hadoop distribution) in the 3 nodes. That
installs Zookeeper.

I used the example directory and copied it to the 3 nodes.

I start Zookeeper in the 3 nodes.

The first time, I run this command on each node, to start Solr:  java -jar
-Dbootstrap_conf=true -DzkHost='node1,node2,node3'  start.jar

As I understand, the -Dbootstrap_conf=true uploads the configuration to
Zookeeper, so I don't need to do that the following times that I start each
SolrCore.

So, the following times, I run this on each node: java -jar
-DzkHost='node0,node1,node2' start.jar

Because I ran that command on node0 first, that node became the leader
shard.

I send an update to the leader shard, (in this case node0):
I run curl 'http://node0:8983/solr/update?commit=true' -H 'Content-type:
text/xml' --data-binary 'adddocfield name=idasdf/fieldfield
name=contentbuggy/field/doc/add'

When I query any shard I get the correct result:
I run curl 'http://node0:8983/solr/select?q=id:asdf'
or curl 'http://node1:8983/solr/select?q=id:asdf'
or curl 'http://node2:8983/solr/select?q=id:asdf'
(i.e. I send the query to each node), and then I get the expected response ...
docstr name=idasdf/strarr name=content strbuggy/str /arr
... /doc...

But when I send an update to a replica shard (node2) it is updated only in
the leader shard (node0) and in the other replica (node1), not in the shard
that received the update (node2):
I send an update to the replica node2,
I run curl 'http://node2:8983/solr/update?commit=true' -H 'Content-type:
text/xml' --data-binary 'adddocfield name=idasdf/fieldfield
name=contentbig moth/field/doc/add'

Then I query each node and I receive the updated results only from the
leader shard (node0) and the other replica shard (node1).

I run (leader, node0):
curl 'http://node0:8983/solr/select?q=id:asdf'
And I get:
... docstr name=idasdf/strarr name=content strbig moth/str
/arr ... /doc ...

I run (other replica, node1):
curl 'http://node1:8983/solr/select?q=id:asdf'
And I get:
... docstr name=idasdf/strarr name=content strbig moth/str
/arr ... /doc ...

I run (first replica, the one that received the update, node2):
curl 'http://node2:8983/solr/select?q=id:asdf'
And I get (old result):
... docstr name=idasdf/strarr name=content strbuggy/str
/arr ... /doc ...

Thanks for your interest,

Sebastián Ramírez


On Mon, May 20, 2013 at 3:30 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Mon, May 20, 2013 at 4:21 PM, Sebastián Ramírez
 sebastian.rami...@senseta.com wrote:
  When I send an update to a non-leader (replica) shard (B), the updated
  results are reflected in the leader shard (A) and in the other replica
  shard (C), but not in the shard that received the update (B).

 I've never seen that before.  The replica that received the update
 isn't treated as special in any way by the code, so it's not clear how
 this could happen.

 What version of Solr is this (and does it happen with the latest
 version)?  How easy is this to reproduce for you?

 -Yonik
 http://lucidworks.com


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-05-10 Thread Sebastián Ramírez
Hello everyone,

I'm having a problem indexing content from opendocument format files. The
files created with OpenOffice and LibreOffice (odt, ods...).

Tika is being able to read the files but Solr is not indexing the content.

It's not a problem of commiting or something like that, after I post a file
it is indexed and all the metadata is indexed/stored but the content isn't
there.


   - I modified the solrconfig.xml file to catch everything:

requestHandler name=/update/extract...

!-- here is the interesting part --

!-- str name=uprefixignored_/str --
str name=defaultFieldall_txt/str



   - Then I submitted the file to Solr:

curl '
http://localhost:8983/solr/update/extract?commit=trueliteral.id=newods' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



   - Now when I do a search in Solr I get this result, there is something
   in the content, but that's not the actual content of the original file:

result name=response numFound=1 start=0
  doc
str name=idnewods/str
arr name=all_txt
  str1/str
  str2013-05-03T10:02:10.58/str
  str2013-05-03T10:02:50.54/str
  str2013-05-03T10:02:50.54/str
  str1/str
  str2013-05-03T10:02:10.58/str
  str1/str
  str2013-05-03T10:02:50.54/str
  str2013-05-03T10:02:50.54/str
  str0/str
  strP0D/str
  str2013-05-03T10:02:10.58/str
  str1/str
  str0/str
  strapplication/ods/str
  str0/str
  str7322/str
  strLibreOffice/4.0.2.2$Windows_x86
LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3/str
  str2013-05-03T10:02:50.54/str
/arr
date name=last_modified2013-05-03T10:02:50Z/date
arr name=content_type
  strapplication/vnd.oasis.opendocument.spreadsheet/str
/arr
arr name=content
  str ???  Page   ??? (???)  00/00/, 00:00:00  Page  //str
/arr
long name=_version_1434658995848609792/long/doc/result/response


   - I ask Solr to show me the extracted content from Tika doing this:

curl 'http://localhost:8983/solr/update/extract?extractOnly=true' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



   - And I get the XHTML extracted from Tika, including the original file
   contents and that final part that Solr is indeed indexing, so, Tika is
   being able to read the file but Solr is not indexing the real content, it
   only indexes the rest:

body
table
tr
td
ptest/p
/td
/tr
tr
td
pde/p
/td
/tr
tr
td
pods/p
/td
/tr
/table

p xmlns=http://www.w3.org/1999/xhtml;???/p
pPage/p
p??? (???)/p
p00/00/, 00:00:00/p
pPage / /p
/body

Do any of you know how to fix/workaround this problem?

Thanks!

Sebastián Ramírez

-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-05-10 Thread Sebastián Ramírez
Thanks for your reply Jack!

First: LOL

Second: I'm using the latest version of libreoffice, but with the
extractOnly param in the Solr request it shows the content of the file so
Tika is being able to read and extract the data but Solr isn't indexing
that data.

Third: I already did that with no luck, I tried
application/vnd.oasis.opendocument.spreadsheet, application/ods and
application/octet-stream but always got the same result.

Following the documentation for
ExtractingRequestHandlerhttp://wiki.apache.org/solr/ExtractingRequestHandler#Concepts
I see that Tika reads the file and feeds it to a SAX ContentHandler, and
Solr then reacts to Tika's SAX events and creates the fields to index. I
think that the problem might be somewhere in that process of feeding the
SAX ContentHandler or the reaction of Solr to those SAX events.

Do you (or anyone else) know how could one configure / debug that SAX
ContentHandler?


Thanks,

Sebastián Ramírez



On Fri, May 10, 2013 at 10:57 AM, Jack Krupansky j...@basetechnology.comwrote:

 Switching to Microsoft Office will probably solve your problem!

 Sorry, I couldn't resist.

 Are you using a really new or really old version of the ODT/ODS software?
 I mean, maybe Tika doesn't have support for that version.

 Check the mime type that Tika generates - maybe you just need to override
 it to force Tika to use the proper format.

 -- Jack Krupansky

 -Original Message- From: Sebastián Ramírez
 Sent: Friday, May 10, 2013 11:24 AM
 To: solr-user@lucene.apache.org
 Subject: Tika not extracting content from ODT / ODS (open document /
 libreoffice) in Solr 4.2.1


 Hello everyone,

 I'm having a problem indexing content from opendocument format files. The
 files created with OpenOffice and LibreOffice (odt, ods...).

 Tika is being able to read the files but Solr is not indexing the content.

 It's not a problem of commiting or something like that, after I post a file
 it is indexed and all the metadata is indexed/stored but the content isn't
 there.


   - I modified the solrconfig.xml file to catch everything:


 requestHandler name=/update/extract...

!-- here is the interesting part --

!-- str name=uprefixignored_/str --
str name=defaultFieldall_txt/**str



   - Then I submitted the file to Solr:


 curl '
 http://localhost:8983/solr/**update/extract?commit=true**
 literal.id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods'
 -H
 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
 --data-binary @test_ods.ods



   - Now when I do a search in Solr I get this result, there is something

   in the content, but that's not the actual content of the original file:

 result name=response numFound=1 start=0
  doc
str name=idnewods/str
arr name=all_txt
  str1/str
  str2013-05-03T10:02:10.58/**str
  str2013-05-03T10:02:50.54/**str
  str2013-05-03T10:02:50.54/**str
  str1/str
  str2013-05-03T10:02:10.58/**str
  str1/str
  str2013-05-03T10:02:50.54/**str
  str2013-05-03T10:02:50.54/**str
  str0/str
  strP0D/str
  str2013-05-03T10:02:10.58/**str
  str1/str
  str0/str
  strapplication/ods/str
  str0/str
  str7322/str
  strLibreOffice/4.0.2.2$**Windows_x86
 LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3/str
  str2013-05-03T10:02:50.54/**str
/arr
date name=last_modified2013-05-**03T10:02:50Z/date
arr name=content_type
  strapplication/vnd.oasis.**opendocument.spreadsheet/str
/arr
arr name=content
  str ???  Page   ??? (???)  00/00/, 00:00:00  Page  //str
/arr
long name=_version_**1434658995848609792/long/**
 doc/result/response


   - I ask Solr to show me the extracted content from Tika doing this:


 curl 
 'http://localhost:8983/solr/**update/extract?extractOnly=**truehttp://localhost:8983/solr/update/extract?extractOnly=true'
 -H
 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
 --data-binary @test_ods.ods



   - And I get the XHTML extracted from Tika, including the original file

   contents and that final part that Solr is indeed indexing, so, Tika is
   being able to read the file but Solr is not indexing the real content, it
   only indexes the rest:

 body
 table
 tr
td
ptest/p
/td
 /tr
 tr
td
pde/p
/td
 /tr
 tr
td
pods/p
/td
 /tr
 /table

 p xmlns=http://www.w3.org/1999/**xhtml http://www.w3.org/1999/xhtml
 ???/p
 pPage/p
 p??? (???)/p
 p00/00/, 00:00:00/p
 pPage / /p
 /body

 Do any of you know how to fix/workaround this problem?

 Thanks!

 Sebastián Ramírez

 --
 *-**---*
 *This e-mail transmission, including any attachments, is intended only for
 the named recipient(s) and may contain information that is privileged,
 confidential and/or exempt from disclosure under applicable law. If you
 have received this transmission in error, or are not the named

Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-05-10 Thread Sebastián Ramírez
Thanks Walter and Alex,

You are right Walter. In fact, if I'm not wrong, Tika doesn't use an
externar parser for those formats as it does with MS Office files or PDFs,
it uses java ZIP and XML libraries to parse those files directly. I guess
that would be my last resort. But I would certainly like if I was able to
make Tika process my files without the overhead of building a kind of
complicated program that extracts the contents of the file while, maybe,
Tika could do that for me.

I think that could be very related Alex. I don't know exactly what the
mapper does, but what you describe seems quite similar. I'm being able to
generate the XHTML from Tika with the original document content, but Solr
doesn't index that content from the XHTML.

So, maybe it's a bug in Solr cell / ExtractingRequestHandler / Tika, right?

Thanks,

Sebastián Ramírez


On Fri, May 10, 2013 at 1:59 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 On Fri, May 10, 2013 at 11:24 AM, Sebastián Ramírez
 sebastian.rami...@senseta.com wrote:
  Hello everyone,
 
  I'm having a problem indexing content from opendocument format files.
 The
  files created with OpenOffice and LibreOffice (odt, ods...).


 I wonder if it is connected to
 https://issues.apache.org/jira/browse/SOLR-4530 where the default Tika
 mapper actually keeps very little of the XHTML it gets. I fixed it for
 DIH in 4.3, but haven't looked at the CELL yet.

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-05-10 Thread Sebastián Ramírez
Many thanks Jack for your attention and effort on solving the problem.

Best,

Sebastián Ramírez


On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
 to index the proper content, both for .ODP and .ODT files.

 If I do extractOnly=true**extractFormat=text, I see the extracted text
 clearly in addition to the metadata.

 I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
 problem. I just see spaces in both cases.

 But whether the problem is due to Solr or Tika, is not apparent.

 In any case, a Jira is warranted.


 -- Jack Krupansky

 -Original Message- From: Sebastián Ramírez
 Sent: Friday, May 10, 2013 11:24 AM
 To: solr-user@lucene.apache.org
 Subject: Tika not extracting content from ODT / ODS (open document /
 libreoffice) in Solr 4.2.1

 Hello everyone,

 I'm having a problem indexing content from opendocument format files. The
 files created with OpenOffice and LibreOffice (odt, ods...).

 Tika is being able to read the files but Solr is not indexing the content.

 It's not a problem of commiting or something like that, after I post a file
 it is indexed and all the metadata is indexed/stored but the content isn't
 there.


   - I modified the solrconfig.xml file to catch everything:


 requestHandler name=/update/extract...

!-- here is the interesting part --

!-- str name=uprefixignored_/str --
str name=defaultFieldall_txt/**str



   - Then I submitted the file to Solr:


 curl '
 http://localhost:8983/solr/**update/extract?commit=true**
 literal.id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods'
 -H
 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
 --data-binary @test_ods.ods



   - Now when I do a search in Solr I get this result, there is something

   in the content, but that's not the actual content of the original file:

 result name=response numFound=1 start=0
  doc
str name=idnewods/str
arr name=all_txt
  str1/str
  str2013-05-03T10:02:10.58/**str
  str2013-05-03T10:02:50.54/**str
  str2013-05-03T10:02:50.54/**str
  str1/str
  str2013-05-03T10:02:10.58/**str
  str1/str
  str2013-05-03T10:02:50.54/**str
  str2013-05-03T10:02:50.54/**str
  str0/str
  strP0D/str
  str2013-05-03T10:02:10.58/**str
  str1/str
  str0/str
  strapplication/ods/str
  str0/str
  str7322/str
  strLibreOffice/4.0.2.2$**Windows_x86
 LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3/str
  str2013-05-03T10:02:50.54/**str
/arr
date name=last_modified2013-05-**03T10:02:50Z/date
arr name=content_type
  strapplication/vnd.oasis.**opendocument.spreadsheet/str
/arr
arr name=content
  str ???  Page   ??? (???)  00/00/, 00:00:00  Page  //str
/arr
long name=_version_**1434658995848609792/long/**
 doc/result/response


   - I ask Solr to show me the extracted content from Tika doing this:


 curl 
 'http://localhost:8983/solr/**update/extract?extractOnly=**truehttp://localhost:8983/solr/update/extract?extractOnly=true'
 -H
 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
 --data-binary @test_ods.ods



   - And I get the XHTML extracted from Tika, including the original file

   contents and that final part that Solr is indeed indexing, so, Tika is
   being able to read the file but Solr is not indexing the real content, it
   only indexes the rest:

 body
 table
 tr
td
ptest/p
/td
 /tr
 tr
td
pde/p
/td
 /tr
 tr
td
pods/p
/td
 /tr
 /table

 p xmlns=http://www.w3.org/1999/**xhtml http://www.w3.org/1999/xhtml
 ???/p
 pPage/p
 p??? (???)/p
 p00/00/, 00:00:00/p
 pPage / /p
 /body

 Do any of you know how to fix/workaround this problem?

 Thanks!

 Sebastián Ramírez

 --
 *-**---*
 *This e-mail transmission, including any attachments, is intended only for
 the named recipient(s) and may contain information that is privileged,
 confidential and/or exempt from disclosure under applicable law. If you
 have received this transmission in error, or are not the named
 recipient(s), please notify Senseta immediately by return e-mail and
 permanently delete this transmission, including any attachments.*


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*


Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-05-10 Thread Sebastián Ramírez
OK Jack, I'll switch to MS Office ...hahaha

Many thanks for your interest and help... and the bug report in JIRA.

Best,

Sebastián Ramírez


On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky j...@basetechnology.comwrote:

 I filed  SOLR-4809 - OpenOffice document body is not indexed by
 SolrCell, including some test files.

 https://issues.apache.org/**jira/browse/SOLR-4809https://issues.apache.org/jira/browse/SOLR-4809

 Yeah, at this stage, switching to Microsoft Office seems like the best bet!


 -- Jack Krupansky

 -Original Message- From: Sebastián Ramírez
 Sent: Friday, May 10, 2013 6:33 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Tika not extracting content from ODT / ODS (open document /
 libreoffice) in Solr 4.2.1


 Many thanks Jack for your attention and effort on solving the problem.

 Best,

 Sebastián Ramírez


 On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
 to index the proper content, both for .ODP and .ODT files.

 If I do extractOnly=trueextractFormat=text, I see the extracted text

 clearly in addition to the metadata.

 I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
 problem. I just see spaces in both cases.

 But whether the problem is due to Solr or Tika, is not apparent.

 In any case, a Jira is warranted.


 -- Jack Krupansky

 -Original Message- From: Sebastián Ramírez
 Sent: Friday, May 10, 2013 11:24 AM
 To: solr-user@lucene.apache.org
 Subject: Tika not extracting content from ODT / ODS (open document /
 libreoffice) in Solr 4.2.1

 Hello everyone,

 I'm having a problem indexing content from opendocument format files.
 The
 files created with OpenOffice and LibreOffice (odt, ods...).

 Tika is being able to read the files but Solr is not indexing the content.

 It's not a problem of commiting or something like that, after I post a
 file
 it is indexed and all the metadata is indexed/stored but the content isn't
 there.


   - I modified the solrconfig.xml file to catch everything:


 requestHandler name=/update/extract...

!-- here is the interesting part --

!-- str name=uprefixignored_/str --
str name=defaultFieldall_txt/str




   - Then I submitted the file to Solr:


 curl '
 http://localhost:8983/solr/update/extract?commit=true**http://localhost:8983/solr/**update/extract?commit=true**
 literal.id=newodshttp://**localhost:8983/solr/update/**
 extract?commit=trueliteral.**id=newodshttp://localhost:8983/solr/update/extract?commit=trueliteral.id=newods
 '
 -H
 'Content-type: application/vnd.oasis.opendocument.spreadsheet'

 --data-binary @test_ods.ods



   - Now when I do a search in Solr I get this result, there is something

   in the content, but that's not the actual content of the original
 file:

 result name=response numFound=1 start=0
  doc
str name=idnewods/str
arr name=all_txt
  str1/str
  str2013-05-03T10:02:10.58/str
  str2013-05-03T10:02:50.54/str
  str2013-05-03T10:02:50.54/str
  str1/str
  str2013-05-03T10:02:10.58/str
  str1/str
  str2013-05-03T10:02:50.54/str

  str2013-05-03T10:02:50.54/str
  str0/str
  strP0D/str
  str2013-05-03T10:02:10.58/str

  str1/str
  str0/str
  strapplication/ods/str
  str0/str
  str7322/str
  strLibreOffice/4.0.2.2$Windows_x86
 LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3/str
  str2013-05-03T10:02:50.54/str
/arr
date name=last_modified2013-05-03T10:02:50Z/date
arr name=content_type
  strapplication/vnd.oasis.opendocument.spreadsheet/str

/arr
arr name=content
  str ???  Page   ??? (???)  00/00/, 00:00:00  Page  //str
/arr
long name=_version_1434658995848609792/long/**

 doc/result/response


   - I ask Solr to show me the extracted content from Tika doing this:


 curl 
 'http://localhost:8983/solr/update/extract?extractOnly=truehttp://localhost:8983/solr/**update/extract?extractOnly=**true
 http://localhost:8983/**solr/update/extract?**extractOnly=truehttp://localhost:8983/solr/update/extract?extractOnly=true
 '
 -H
 'Content-type: application/vnd.oasis.opendocument.spreadsheet'

 --data-binary @test_ods.ods



   - And I get the XHTML extracted from Tika, including the original file

   contents and that final part that Solr is indeed indexing, so, Tika is
   being able to read the file but Solr is not indexing the real content,
 it
   only indexes the rest:

 body
 table
 tr
td
ptest/p
/td
 /tr
 tr
td
pde/p
/td
 /tr
 tr
td
pods/p
/td
 /tr
 /table

 p xmlns=http://www.w3.org/1999/xhtmlhttp://www.w3.org/1999/**xhtml
 http://www.w3.org/1999/xhtml

 ???/p
 pPage/p
 p??? (???)/p
 p00/00/, 00:00:00/p
 pPage / /p
 /body

 Do any of you know how to fix/workaround this problem?

 Thanks