Re: Solr Cell Question

2013-09-09 Thread Jamie Johnson
Thanks Erick,  This is how I was doing it but when I saw the Solr Cell
stuff I figured I'd give it a go.  What I ended up doing is the following

ModifiableSolrParams params = indexer.index(artifact);

 params.add(fmap.content, my_custom_field);

 params.add(extractFormat, text);

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
/update/extract);

 up.setParams(params);

 FileStream f = new FileStream(new File());

 up.addContentStream(f);


On Fri, Sep 6, 2013 at 9:54 AM, Erick Erickson erickerick...@gmail.comwrote:

 It's always frustrating when someone replies with Why not do it
 a completely different way?.  But I will anyway :).

 There's no requirement at all that you send things to Solr to make
 Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
 anyway, why not just parse on the client? This has the advantage
 of allowing you to offload the Tika processing from Solr which can
 be quite expensive. You can use the same Tika jars that come
 with Solr or download whatever version from the Tika project
 you want. That way, you can exercise much better control over
 what's done.

 Here's a skeletal program with indexing from a DB mixed in, but
 it shouldn't be hard at all to pull the DB parts out.

 http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

 FWIW,
 Erick


 On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote:

  Is it possible to configure solr cell to only extract and store the body
 of
  a document when indexing?  I'm currently doing the following which I
  thought would work
 
  ModifiableSolrParams params = new ModifiableSolrParams();
 
   params.set(defaultField, content);
 
   params.set(xpath, /xhtml:html/xhtml:body/descendant::node());
 
   ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
  /update/extract);
 
   up.setParams(params);
 
   FileStream f = new FileStream(new File(..));
 
   up.addContentStream(f);
 
  up.setAction(ACTION.COMMIT, true, true);
 
  solrServer.request(up);
 
 
  But the result of content is as follows
 
  arr name=content_mvtxt
  str/
  strnull/str
  strISO-8859-1/str
  strtext/plain; charset=ISO-8859-1/str
  strJust a little test/str
  /arr
 
 
  What I had hoped for was just
 
  arr name=content_mvtxt
  strJust a little test/str
  /arr
 



Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
It's always frustrating when someone replies with Why not do it
a completely different way?.  But I will anyway :).

There's no requirement at all that you send things to Solr to make
Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ
anyway, why not just parse on the client? This has the advantage
of allowing you to offload the Tika processing from Solr which can
be quite expensive. You can use the same Tika jars that come
with Solr or download whatever version from the Tika project
you want. That way, you can exercise much better control over
what's done.

Here's a skeletal program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.

http://searchhub.org/dev/2012/02/14/indexing-with-solrj/

FWIW,
Erick


On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote:

 Is it possible to configure solr cell to only extract and store the body of
 a document when indexing?  I'm currently doing the following which I
 thought would work

 ModifiableSolrParams params = new ModifiableSolrParams();

  params.set(defaultField, content);

  params.set(xpath, /xhtml:html/xhtml:body/descendant::node());

  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
 /update/extract);

  up.setParams(params);

  FileStream f = new FileStream(new File(..));

  up.addContentStream(f);

 up.setAction(ACTION.COMMIT, true, true);

 solrServer.request(up);


 But the result of content is as follows

 arr name=content_mvtxt
 str/
 strnull/str
 strISO-8859-1/str
 strtext/plain; charset=ISO-8859-1/str
 strJust a little test/str
 /arr


 What I had hoped for was just

 arr name=content_mvtxt
 strJust a little test/str
 /arr



Solr Cell Question

2013-09-05 Thread Jamie Johnson
Is it possible to configure solr cell to only extract and store the body of
a document when indexing?  I'm currently doing the following which I
thought would work

ModifiableSolrParams params = new ModifiableSolrParams();

 params.set(defaultField, content);

 params.set(xpath, /xhtml:html/xhtml:body/descendant::node());

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
/update/extract);

 up.setParams(params);

 FileStream f = new FileStream(new File(..));

 up.addContentStream(f);

up.setAction(ACTION.COMMIT, true, true);

solrServer.request(up);


But the result of content is as follows

arr name=content_mvtxt
str/
strnull/str
strISO-8859-1/str
strtext/plain; charset=ISO-8859-1/str
strJust a little test/str
/arr


What I had hoped for was just

arr name=content_mvtxt
strJust a little test/str
/arr