Re: Solr Cell Question
Thanks Erick, This is how I was doing it but when I saw the Solr Cell stuff I figured I'd give it a go. What I ended up doing is the following ModifiableSolrParams params = indexer.index(artifact); params.add(fmap.content, my_custom_field); params.add(extractFormat, text); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File()); up.addContentStream(f); On Fri, Sep 6, 2013 at 9:54 AM, Erick Erickson erickerick...@gmail.comwrote: It's always frustrating when someone replies with Why not do it a completely different way?. But I will anyway :). There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This has the advantage of allowing you to offload the Tika processing from Solr which can be quite expensive. You can use the same Tika jars that come with Solr or download whatever version from the Tika project you want. That way, you can exercise much better control over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote: Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html/xhtml:body/descendant::node()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File(..)); up.addContentStream(f); up.setAction(ACTION.COMMIT, true, true); solrServer.request(up); But the result of content is as follows arr name=content_mvtxt str/ strnull/str strISO-8859-1/str strtext/plain; charset=ISO-8859-1/str strJust a little test/str /arr What I had hoped for was just arr name=content_mvtxt strJust a little test/str /arr
Re: Solr Cell Question
It's always frustrating when someone replies with Why not do it a completely different way?. But I will anyway :). There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This has the advantage of allowing you to offload the Tika processing from Solr which can be quite expensive. You can use the same Tika jars that come with Solr or download whatever version from the Tika project you want. That way, you can exercise much better control over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote: Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html/xhtml:body/descendant::node()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File(..)); up.addContentStream(f); up.setAction(ACTION.COMMIT, true, true); solrServer.request(up); But the result of content is as follows arr name=content_mvtxt str/ strnull/str strISO-8859-1/str strtext/plain; charset=ISO-8859-1/str strJust a little test/str /arr What I had hoped for was just arr name=content_mvtxt strJust a little test/str /arr
Solr Cell Question
Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html/xhtml:body/descendant::node()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File(..)); up.addContentStream(f); up.setAction(ACTION.COMMIT, true, true); solrServer.request(up); But the result of content is as follows arr name=content_mvtxt str/ strnull/str strISO-8859-1/str strtext/plain; charset=ISO-8859-1/str strJust a little test/str /arr What I had hoped for was just arr name=content_mvtxt strJust a little test/str /arr