Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread xavi jmlucjav
I guess there is no other way than reindex:
- of course, not all fields are stored, that would have been too easy
- it might (??) work if as Jan says I build a custom solr version with
removed IntFields added etc, but going down this rabbithole sounds too
risky, too much work for what, not sure it would eventually work, specially
considering the last point:
- I did not get any response to this, but my understanding now is that you
cannot take a standalone solr core /data  (without a _version_ field) and
put that into solrcloud setup, as _version_ is needed.

xavier

On Mon, Sep 26, 2016 at 9:21 PM, Jan Høydahl <j...@cominvent.com> wrote:

> If all the fields in your current schema has stored=“true”, you can try to
> export
> the full index to an XML file which can then be imported into 6.1.
> If some fields are not stored you will only be able to recover the
> inverted index
> representation of that data, which may not be enough to recreate the
> original
> data (or in some cases maybe it is enough).
>
> If you share a copy of your old schema.xml we may be able to help.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 26. sep. 2016 kl. 20.39 skrev Shawn Heisey <apa...@elyograg.org>:
> >
> > On 9/26/2016 6:28 AM, xavi jmlucjav wrote:
> >> Yes, I had to change some fields, basically to use TrieIntField etc
> >> instead
> >> of the old IntField. I was assuming by using the IndexUpgrader to
> upgrade
> >> the data to 6.1, the older IntField would work with the new
> TrieIntField.
> >> But I have tried loading the upgraded data into a standalone 6.1 and I
> am
> >> hitting the same issue, so this is not related to _version_ field (more
> on
> >> that below). Forget about solrcloud for now, having an old 3.6 index,
> >> should it be possible to use IndexUpgrader and load it on 6.1? How would
> >> one need to handle IntFields etc?
> >
> > The only option when you change the class on a field in your schema is
> > to wipe the index and rebuild it.  TrieIntField uses a completely
> > different on-disk data format than IntField did.  The two formats simply
> > aren't compatible.  This is not a bug, it's a fundamental fact of Lucene
> > indexes.
> >
> > Lucene doesn't use a schema -- that's a Solr concept.  IndexUpgrader is
> > a Lucene program that doesn't know what kind of data each field
> > contains, it just reaches down into the old index format, grabs the
> > internal data in each field, and copies it to a new index using the new
> > format.  The internal data must still be consistent with the Lucene
> > program for the index to work in a new version.  When you're running
> > Solr, it uses the schema to know how to read the index.
> >
> > In 5.x and 6.x, IntField does not exist, and attempting to read that
> > data using TrieIntField will not work.
> >
> > The luceneMatchVersion setting in solrconfig.xml can cause certain
> > components (tokenizers and filters mainly) to revert to old behavior in
> > the previous major version.  Version 6.x doesn't hold onto behavior from
> > 3.x and 4.x -- it can only revert behavior back to 5.x versions.
> >
> > The luceneMatchVersion setting cannot bring back removed classes like
> > IntField, and it does NOT affect the on-disk index format.
> >
> > Your particular situation will require a full reindex.  It is not
> > possible to upgrade an index using those old class types.
> >
> > Thanks,
> > Shawn
> >
>
>


Re: issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-26 Thread xavi jmlucjav
Hi Shawn/Jan,

On Sun, Sep 25, 2016 at 6:18 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 9/25/2016 4:24 AM, xavi jmlucjav wrote:
> > Everything went well, no errors when solr restarted, the collections
> shows
> > the right number of docs. But when I try to run a query, I get:
> >
> > null:java.lang.NullPointerException
>
> Did you change any of the fieldType class values as you adjusted the
> schema for the upgrade?  A number of classes that were valid and
> deprecated in 3.6 and 4.x were completely removed by 5.x, and 6.x
> probably removed a few more.
>

Yes, I had to change some fields, basically to use TrieIntField etc instead
of the old IntField. I was assuming by using the IndexUpgrader to upgrade
the data to 6.1, the older IntField would work with the new TrieIntField.
But I have tried loading the upgraded data into a standalone 6.1 and I am
hitting the same issue, so this is not related to _version_ field (more on
that below). Forget about solrcloud for now, having an old 3.6 index,
should it be possible to use IndexUpgrader and load it on 6.1? How would
one need to handle IntFields etc?



>
> If you did make changes like this to your schema, then what's in the
> index will no longer match the schema, and the *only* option is a
> reindex.  Exceptions are likely if you don't reindex after schema
> changes to the class value(s) or the index analyzer(s).
>
> Regarding the _version_ field:  SolrCloud expects this field to be in
> your schema.  It might also expect that that every document in the index
> will already contain a value in this field.  Adding _version_ to your
> schema should be treated similarly to the changes mentioned above -- a
> reindex is required for proper operation.
>
> Even if the schema didn't change in a way that *requires* a reindex ...
> the number of changes to the analysis components across three major
> version jumps is quite large.  Solr might not work as expected because
> of those changes unless you reindex, even if you don't see any
> exceptions.  Changes to your schema because of changes in analysis
> component behavior might  be required -- which is another situation that
> usually requires a reindex.
>
> Because of these potential problems, I always start a new Solr version
> with no index data and completely rebuild my indexes after an upgrade.
> That is the best way to ensure success.
>

I am totally aware of all the advantages of reindexing, sure. And that is
what I always do, this time thought, seems the original data is not
available...


> You referenced a mailing list thread where somebody had success
> converting non-cloud to cloud... but that was on version 4.8.1, two
> major versions back from the version you're running.  They also did not
> upgrade major versions -- from some things they said at the beginning of
> the thread, I know that the source version was at least 4.4.  The thread
> didn't mention any schema changes, either.
>
> If the schema doesn't change at all, moving from non-cloud to cloud is
> very possible, but if the schema changes, the index data might not match
> the schema any more, and that situation will not work.
>
Since you jumped three major versions, it's almost guaranteed that your
> schema *did* change, and the changes may have been more extensive than
> just adding the _version_ field.
>
> It's possible that there's a problem when converting a non-cloud install
> with no _version_ field to a cloud install where the only schema change
> is adding the _version_ field.  We can treat THAT situation as a bug,
> but if there are other schema changes besides adding _version_, the
> exception you encountered is most likely not a bug.
>


The are two orthogonal issues here:
A. moving to solrcloud from  standalone without reindexing. And without
having a _version_ field already indexed, of course. Is this even possible?
>From the thread above, I understood it was possible, but you say that
solrcloud expects _version_ to be there, with values, so this makes this
move totally impossible without a reindexing. This should be made clear
somewhere in the doc. I understand it is not a frequent scenario, but will
be a deal breaker when it happens. So far the only thing I found is the
aforementioned thread, that if I am not misreading, makes it sound as it
will work ok.

B. upgrading from a very old 3.6 version to 6.1 without reindexing: it
seems like I am hitting an issue with this first. Even if this was
resolved, I would not be able to achieve my goal due A, but would be good
to know how to get this done too, if possible.

Jan: I tried tweaking luceneMatchVersion too, no luck though.
xavier


>
> Thanks,
> Shawn
>
>


issue transplanting standalone core into solrcloud (plus upgrade)

2016-09-25 Thread xavi jmlucjav
Hi,

I have an existing 3.6 standalone installation. It has to be moved to
Solrcloud 6.1.0. Reindexing is not an option, so I did the following:

- Use IndexUpgrader to upgrade 3.6 -> 4.4 -> 5.5. I did not upgrade to 6.X
as 5.5 should be readable by 6.x
- Install solrcloud 6.1 cluster
- modify schema/solrconfig for cloud support (add _version_, tlog etc)
- follow the method mentioned here
http://lucene.472066.n3.nabble.com/Copy-existing-index-from-standalone-Solr-to-Solr-cloud-td4149920.html
I did not find any other doc on how to transplant a standalone core int
solrcloud

Everything went well, no errors when solr restarted, the collections shows
the right number of docs. But when I try to run a query, I get:

null:java.lang.NullPointerException
at
org.apache.lucene.util.LegacyNumericUtils.prefixCodedToLong(LegacyNumericUtils.java:189)
at org.apache.solr.schema.TrieField.toObject(TrieField.java:155)
at org.apache.solr.schema.TrieField.write(TrieField.java:324)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:133)
at
org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:345)
at
org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:249)
at
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:151)
at
org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)
at
org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)
at
org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)
at
org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)
at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:65)
at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:731)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:473)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)

I was wondering how the non existance of the _version_ field would be
handled, but as that thread above said it would work.
Can anyone shed some light?

thanks


Re: cursorMark and CSVResponseWriter for mass reindex

2016-06-21 Thread xavi jmlucjav
Hi Erick,

Ah, yes I guess you are correct in that could just avoid using cursorMark
this way...the only (smallish I think) issue is that I would need to
extract the last id from the csv output. Oh and that I am using Datastaxx
DSE, so uniqueKey is a combination of two fields...but I think I can manage
to use a field I now it's unique, even if its not uniqueKey.

thanks!
Xavier



On Tue, Jun 21, 2016 at 2:13 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> The CursorMark stuff has to deal with shards, what happens when more
> than one document on different shards has the same sort value, what
> if all the docs in the response packet have the same sort value, what
> happens when you want to return docs by score and the like.
>
> For your case you can use a sort criteria that avoids all these issues and
> be
> OK. You can think of it as a specialized CursorMark.
>
> You should be able to just sort by
>  and send each query through with a range filter query,
> so the first query would look something like (assuming "id" is your
> )
>
> q=*:*=id asc=0=1000
> then the rest would be
> q=*:*=id asc={!cache=false}id:[last_id_returned_from_previous_query
> TO *]=0=1000
>
> this avoids the "deep paging" problem that CursorMark solves more cheaply
> because the  guarantees that there is one and only one doc with
> that value. Note that the start parameter is always 0.
>
> Or your second query could even be just
> q=id:[last_id_returned_from_previous_query TO *]=id
> asc=0=1000
>
> Best,
> Erick
>
> On Mon, Jun 20, 2016 at 12:37 PM, xavi jmlucjav <jmluc...@gmail.com>
> wrote:
> > Hi,
> >
> > I need to index into a new schema 800M docs, that exist in an older solr.
> > As all fields are stored, I thought I was very lucky as I could:
> >
> > - use wt=csv
> > - combined with cursorMark
> >
> > to easily script out something that would export/index in chunks of 1M
> docs
> > or something. CVS output being very efficient for this sort of thing, I
> > think.
> >
> > But, sadly I found that there is no way to get the nextcursorMark after
> the
> > first request, as the csvwriter just outputs plailn csv info of the
> fields,
> > excluding all other info on the response!!!
> >
> > This is so unfortunate, as csv/cursorMark seem like the perfect fit to
> > reindex this huge index (it's a one time thing).
> >
> > Does anyone see some way to still be able to use this? I would prefer not
> > having to write some java code just to get the nextcursorMark.
> >
> > So far I thought of:
> > - use json, but I need to postprocess returned json to remove the
> response
> > info etc, before reindexing, a pain.
> > - send two calls for each chunk (sending the same cursormark both times),
> > one wt=csv to get the data, another wt=json to get cursormark (and ignore
> > the data, maybe using fl=id only to avoid getting much data). I did some
> > test and this seems should work.
> >
> > I guess I will go with the 2nd, but anyone has a better idea?
> > thanks
> > xavier
>


cursorMark and CSVResponseWriter for mass reindex

2016-06-20 Thread xavi jmlucjav
Hi,

I need to index into a new schema 800M docs, that exist in an older solr.
As all fields are stored, I thought I was very lucky as I could:

- use wt=csv
- combined with cursorMark

to easily script out something that would export/index in chunks of 1M docs
or something. CVS output being very efficient for this sort of thing, I
think.

But, sadly I found that there is no way to get the nextcursorMark after the
first request, as the csvwriter just outputs plailn csv info of the fields,
excluding all other info on the response!!!

This is so unfortunate, as csv/cursorMark seem like the perfect fit to
reindex this huge index (it's a one time thing).

Does anyone see some way to still be able to use this? I would prefer not
having to write some java code just to get the nextcursorMark.

So far I thought of:
- use json, but I need to postprocess returned json to remove the response
info etc, before reindexing, a pain.
- send two calls for each chunk (sending the same cursormark both times),
one wt=csv to get the data, another wt=json to get cursormark (and ignore
the data, maybe using fl=id only to avoid getting much data). I did some
test and this seems should work.

I guess I will go with the 2nd, but anyone has a better idea?
thanks
xavier


issues using BlendedInfixLookupFactory in solr5.5

2016-03-31 Thread xavi jmlucjav
Hi,

I have been working with
AnalyzingInfixLookupFactory/BlendedInfixLookupFactory in 5.5.0, and I have
a number of questions/comments, hopefully I get some insight into this:

- Doc not complete/up-to-date:
- blenderType param does not accept 'linear' value, it did in 5.3. I
commented it out as it's the default.
- it should be mentioned contextField must be a stored field
- if the field used is whitespace tokenized, and you search for 'one t',
the suggestions are sorted by weight, not score. So if you give a constant
score to all docs, you might get this:
1. one four two
2. one two four
  Would taking the score into account (something not done yet but could be
done according to something I saw in code/jira) return 2,1 instead of 1,2?
My guess is it would, correct?
- what would we need to return the score too? Could it be done easily?
along with the payload or something.
- would it be possible to make BlendedInfixLookupFactory allow for some
fuzziness a la FuzzyLookupFactory?
- when building a big suggester, it can take a long time, you just send a
request with suggest.build=true and wait. Is there any possible way to
monitor the progress of this? I did not find one.
- for weightExpression, one typical use case would be to provide the users'
lat/lon to weight the suggestions by proximity, is this somehow feasible?
What would be needed?
- does SolrCloud totally support suggesters? If so does each shard build
its own suggester and it works just like a normal distributed search ?
- I filled SOLR-8928 suggest.cfq does not work with
DocumentExpressionDictionaryFactory/weightExpression as I found that combo
not working.

regards
xavi


Re: Stopping Solr JVM on OOM

2016-03-19 Thread xavi jmlucjav
In order to force a OOM do this:

- index a sizable amount of docs with normal -Xmx, if you already have 350k
docs indexed, that should be enough
- now, stop solr and decrease memory, like -Xmx=15m, start it, and run a
query with a facet on a field with very high cardinality, ask for all
facets. If not enough, add another facet field etc. This is a sure way to
get OOM

On Mon, Mar 14, 2016 at 9:42 AM, Binoy Dalal  wrote:

> I set the heap to 16 mb and tried to index about 350k records using a DIH.
> This did throw an OOM for that particular thread in the console, but the
> oom script wasn't called and solr was running properly.
> Moreover, solr also managed to index all 350k records.
>
> Is this the correct way to o about getting solr to throw an oom?
> If so where did I go wrong?
> If not, what other alternative is there?
>
> Thanks.
>
> PS. I tried to start solr with really low memory (abt. 2k) but that just
> threw an error saying too small a heap and the JVM didn't start at all.
>
> On Mon, 14 Mar 2016, 07:57 Shawn Heisey,  wrote:
>
> > On 3/13/2016 8:13 PM, Binoy Dalal wrote:
> > > I made the necessary changes to that oom script?
> > > How does it look now?
> > > Also can you suggest some way of testing it with solr?
> > > How do I make solr oom on purpose?
> >
> > Set the java heap really small.  Not entirely sure what value to use.
> > I'd probably start with 32m and work my way down.  With a small enough
> > heap, you could probably produce OOM without even trying to USE Solr.
> >
> > Thanks,
> > Shawn
> >
> > --
> Regards,
> Binoy Dalal
>


Re: How is Tika used with Solr

2016-02-12 Thread xavi jmlucjav
Of course, but that code is very tricky, so if the extraction library takes
care of all that, it's a huge gain. The Aperture library I used worked very
well in that regard, and even though it did not use processes as Timothy
says, it never got stuck if I remember correctly.

On Fri, Feb 12, 2016 at 1:46 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Well, I'd imagine you could spawn threads and monitor/kill them as
> necessary, although that doesn't deal with OOM errors
>
> FWIW,
> Erick
>
> On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <jmluc...@gmail.com> wrote:
> > For sure, if I need heavy duty text extraction again, Tika would be the
> > obvious choice if it covers dealing with hangs. I never used tika-server
> > myself (not sure if it existed at the time) just used tika from my own
> jvm.
> >
> > On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <talli...@mitre.org
> >
> > wrote:
> >
> >> x-post to Tika user's
> >>
> >> Y and n.  If you run tika app as:
> >>
> >> java -jar tika-app.jar  
> >>
> >> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).
> This
> >> creates a parent and child process, if the child process notices a hung
> >> thread, it dies, and the parent restarts it.  Or if your OS gets upset
> with
> >> the child process and kills it out of self preservation, the parent
> >> restarts the child, or if there's an OOM...and you can configure how
> often
> >> the child shuts itself down (with parental restarting) to mitigate
> memory
> >> leaks.
> >>
> >> So, y, if your use case allows  , then we now
> have
> >> that in Tika.
> >>
> >> I've been wanting to add a similar watchdog to tika-server ... any
> >> interest in that?
> >>
> >>
> >> -Original Message-
> >> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> >> Sent: Thursday, February 11, 2016 2:16 PM
> >> To: solr-user <solr-user@lucene.apache.org>
> >> Subject: Re: How is Tika used with Solr
> >>
> >> I have found that when you deal with large amounts of all sort of files,
> >> in the end you find stuff (pdfs are typically nasty) that will hang
> tika.
> >> That is even worse that a crash or OOM.
> >> We used aperture instead of tika because at the time it provided a
> >> watchdog feature to kill what seemed like a hanged extracting thread.
> That
> >> feature is super important for a robust text extracting pipeline. Has
> Tika
> >> gained such feature already?
> >>
> >> xavier
> >>
> >> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >> > Timothy's points are absolutely spot-on. In production scenarios, if
> >> > you use the simple "run Tika in a SolrJ program" approach you _must_
> >> > abort the program on OOM errors and the like and  figure out what's
> >> > going on with the offending document(s). Or record the name somewhere
> >> > and skip it next time 'round. Or
> >> >
> >> > How much you have to build in here really depends on your use case.
> >> > For "small enough"
> >> > sets of documents or one-time indexing, you can get by with dealing
> >> > with errors one at a time.
> >> > For robust systems where you have to have indexing available at all
> >> > times and _especially_ where you don't control the document corpus,
> >> > you have to build something far more tolerant as per Tim's comments.
> >> >
> >> > FWIW,
> >> > Erick
> >> >
> >> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> >> > <talli...@mitre.org>
> >> > wrote:
> >> > > I completely agree on the impulse, and for the vast majority of the
> >> > > time
> >> > (regular catchable exceptions), that'll work.  And, by vast majority,
> >> > aside from oom on very large files, we aren't seeing these problems
> >> > any more in our 3 million doc corpus (y, I know, small by today's
> >> > standards) from
> >> > govdocs1 and Common Crawl over on our Rackspace vm.
> >> > >
> >> > > Given my focus on Tika, I'm overly sensitive to the worst case
> >> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> >> > types of problems, that users aren't complaining too

Re: How is Tika used with Solr

2016-02-11 Thread xavi jmlucjav
I have found that when you deal with large amounts of all sort of files, in
the end you find stuff (pdfs are typically nasty) that will hang tika. That
is even worse that a crash or OOM.
We used aperture instead of tika because at the time it provided a watchdog
feature to kill what seemed like a hanged extracting thread. That feature
is super important for a robust text extracting pipeline. Has Tika gained
such feature already?

xavier

On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
wrote:

> Timothy's points are absolutely spot-on. In production scenarios, if
> you use the simple
> "run Tika in a SolrJ program" approach you _must_ abort the program on
> OOM errors
> and the like and  figure out what's going on with the offending
> document(s). Or record the
> name somewhere and skip it next time 'round. Or
>
> How much you have to build in here really depends on your use case.
> For "small enough"
> sets of documents or one-time indexing, you can get by with dealing
> with errors one at a time.
> For robust systems where you have to have indexing available at all
> times and _especially_
> where you don't control the document corpus, you have to build
> something far more
> tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B. 
> wrote:
> > I completely agree on the impulse, and for the vast majority of the time
> (regular catchable exceptions), that'll work.  And, by vast majority, aside
> from oom on very large files, we aren't seeing these problems any more in
> our 3 million doc corpus (y, I know, small by today's standards) from
> govdocs1 and Common Crawl over on our Rackspace vm.
> >
> > Given my focus on Tika, I'm overly sensitive to the worst case
> scenarios.  I find it encouraging, Erick, that you haven't seen these types
> of problems, that users aren't complaining too often about catastrophic
> failures of Tika within Solr Cell, and that this thread is not yet swamped
> with integrators agreeing with me. :)
> >
> > However, because oom can leave memory in a corrupted state (right?),
> because you can't actually kill a thread for a permanent hang and because
> Tika is a kitchen sink and we can't prevent memory leaks in our
> dependencies, one needs to be aware that bad things can happen...if only
> very, very rarely.  For a fellow traveler who has run into these issues on
> massive data sets, see also [0].
> >
> > Configuring Hadoop to work around these types of problems is not too
> difficult -- it has to be done with some thought, though.  On conventional
> single box setups, the ForkParser within Tika is one option, tika-batch is
> another.  Hand rolling your own parent/child process is non-trivial and is
> not necessary for the vast majority of use cases.
> >
> >
> > [0]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> >
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, February 09, 2016 10:05 PM
> > To: solr-user 
> > Subject: Re: How is Tika used with Solr
> >
> > My impulse would be to _not_ run Tika in its own JVM, just catch any
> exceptions in my code and "do the right thing". I'm not sure I see any real
> benefit in yet another JVM.
> >
> > FWIW,
> > Erick
> >
> > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. 
> wrote:
> >> I have one answer here [0], but I'd be interested to hear what Solr
> users/devs/integrators have experienced on this topic.
> >>
> >> [0]
> >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1P
> >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlo
> >> ok.com%3E
> >>
> >> -Original Message-
> >> From: Steven White [mailto:swhite4...@gmail.com]
> >> Sent: Tuesday, February 09, 2016 6:33 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How is Tika used with Solr
> >>
> >> Thank you Erick and Alex.
> >>
> >> My main question is with a long running process using Tika in the same
> JVM as my application.  I'm running my file-system-crawler in its own JVM
> (not Solr's).  On Tika mailing list, it is suggested to run Tika's code in
> it's own JVM and invoke it from my file-system-crawler using
> Runtime.getRuntime().exec().
> >>
> >> I fully understand from Alex suggestion and link provided by Erick to
> use Tika outside Solr.  But what about using Tika within the same JVM as my
> file-system-crawler application or should I be making a system call to
> invoke another JAR, that runs in its own JVM to extract the raw text?  Are
> there known issues with Tika when used in a long running process?
> >>
> >> Steve
> >>
> >>
>


Re: How is Tika used with Solr

2016-02-11 Thread xavi jmlucjav
For sure, if I need heavy duty text extraction again, Tika would be the
obvious choice if it covers dealing with hangs. I never used tika-server
myself (not sure if it existed at the time) just used tika from my own jvm.

On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> x-post to Tika user's
>
> Y and n.  If you run tika app as:
>
> java -jar tika-app.jar  
>
> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  This
> creates a parent and child process, if the child process notices a hung
> thread, it dies, and the parent restarts it.  Or if your OS gets upset with
> the child process and kills it out of self preservation, the parent
> restarts the child, or if there's an OOM...and you can configure how often
> the child shuts itself down (with parental restarting) to mitigate memory
> leaks.
>
> So, y, if your use case allows  , then we now have
> that in Tika.
>
> I've been wanting to add a similar watchdog to tika-server ... any
> interest in that?
>
>
> -Original Message-
> From: xavi jmlucjav [mailto:jmluc...@gmail.com]
> Sent: Thursday, February 11, 2016 2:16 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: How is Tika used with Solr
>
> I have found that when you deal with large amounts of all sort of files,
> in the end you find stuff (pdfs are typically nasty) that will hang tika.
> That is even worse that a crash or OOM.
> We used aperture instead of tika because at the time it provided a
> watchdog feature to kill what seemed like a hanged extracting thread. That
> feature is super important for a robust text extracting pipeline. Has Tika
> gained such feature already?
>
> xavier
>
> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Timothy's points are absolutely spot-on. In production scenarios, if
> > you use the simple "run Tika in a SolrJ program" approach you _must_
> > abort the program on OOM errors and the like and  figure out what's
> > going on with the offending document(s). Or record the name somewhere
> > and skip it next time 'round. Or
> >
> > How much you have to build in here really depends on your use case.
> > For "small enough"
> > sets of documents or one-time indexing, you can get by with dealing
> > with errors one at a time.
> > For robust systems where you have to have indexing available at all
> > times and _especially_ where you don't control the document corpus,
> > you have to build something far more tolerant as per Tim's comments.
> >
> > FWIW,
> > Erick
> >
> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> > > I completely agree on the impulse, and for the vast majority of the
> > > time
> > (regular catchable exceptions), that'll work.  And, by vast majority,
> > aside from oom on very large files, we aren't seeing these problems
> > any more in our 3 million doc corpus (y, I know, small by today's
> > standards) from
> > govdocs1 and Common Crawl over on our Rackspace vm.
> > >
> > > Given my focus on Tika, I'm overly sensitive to the worst case
> > scenarios.  I find it encouraging, Erick, that you haven't seen these
> > types of problems, that users aren't complaining too often about
> > catastrophic failures of Tika within Solr Cell, and that this thread
> > is not yet swamped with integrators agreeing with me. :)
> > >
> > > However, because oom can leave memory in a corrupted state (right?),
> > because you can't actually kill a thread for a permanent hang and
> > because Tika is a kitchen sink and we can't prevent memory leaks in
> > our dependencies, one needs to be aware that bad things can
> > happen...if only very, very rarely.  For a fellow traveler who has run
> > into these issues on massive data sets, see also [0].
> > >
> > > Configuring Hadoop to work around these types of problems is not too
> > difficult -- it has to be done with some thought, though.  On
> > conventional single box setups, the ForkParser within Tika is one
> > option, tika-batch is another.  Hand rolling your own parent/child
> > process is non-trivial and is not necessary for the vast majority of use
> cases.
> > >
> > >
> > > [0]
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > >
> > >
> > >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Tue

Re: Json Facet api on nested doc

2015-11-24 Thread xavi jmlucjav
Mikahil, Yonik

thanks for having a look. This was my bad all the time...I forgot I was on
5.2.1 instead of 5.3.1 on this setup!! It seems some things were not there
yet on 5.2.1, I just upgraded to 5.3.1 and my query works perfectly.

Although I do agree with Mikhail the docs on this feature are a bit light,
it is understandable though, as it is quite new.

thanks
xavi

On Mon, Nov 23, 2015 at 9:24 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Indeed! Now it works for me too. JSON Facets seems powerful, but not
> friendly to me.
> Yonik, thanks for example!
>
> Xavi,
>
> I took  json docs from http://yonik.com/solr-nested-objects/ and just
> doubled book2_c3
>
> Here is what I have with json.facet={catz: {type:terms,field:cat_s,
> facet:{ starz:{type:terms, field:stars_i,
> domain:{blockChildren:'type_s:book'}} }}}
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 2,
> "params": {
>   "q": "publisher_s:*",
>   "json.facet": "{catz: {type:terms,field:cat_s, facet:{ 
> starz:{type:terms, field:stars_i, domain:{blockChildren:'type_s:book'}} }}}",
>   "indent": "true",
>   "wt": "json",
>   "_": "1448309900982"
> }
>   },
>   "response": {
> "numFound": 2,
> "start": 0,
> "docs": [
>   {
> "id": "book1",
> "type_s": "book",
> "title_t": "The Way of Kings",
> "author_s": "Brandon Sanderson",
> "cat_s": "fantasy",
> "pubyear_i": 2010,
> "publisher_s": "Tor",
> "_version_": 1518570756086169600
>   },
>   {
> "id": "book2",
> "type_s": "book",
> "title_t": "Snow Crash",
> "author_s": "Neal Stephenson",
> "cat_s": "sci-fi",
> "pubyear_i": 1992,
> "publisher_s": "Bantam",
> "_version_": 1518570908026929200
>   }
> ]
>   },
>   "facets": {
> "count": 2,
> "catz": {
>   "buckets": [
> {
>   "val": "fantasy",
>   "count": 1,
>   "starz": {
> "buckets": [
>   {
> "val": 3,
> "count": 1
>   },
>   {
> "val": 5,
> "count": 1
>   }
> ]
>   }
> },
> {
>   "val": "sci-fi",
>   "count": 1,
>   "starz": {
> "buckets": [
>   {
> "val": 2,
> "count": 2
>   },
>   {
> "val": 4,
> "count": 1
>   },
>   {
> "val": 5,
> "count": 1
>   }
> ]
>   }
> }
>   ]
> }
>   }
> }
>
> It works well with *:* too.
>
>
> On Mon, Nov 23, 2015 at 12:56 AM, Yonik Seeley  wrote:
>
>> On Sun, Nov 22, 2015 at 3:10 PM, Mikhail Khludnev
>>  wrote:
>> > Hello,
>> >
>> > I also played with json.facet, but couldn't achieve the desired result
>> too.
>> >
>> > Yonik, Alessandro,
>> > Do you think it's a new feature or it can be achieved with the current
>> > implementation?
>>
>> Not sure if I'm misunderstanding the example, but it looks
>> straight-forward.
>>
>> terms facet on parent documents, with sub-facet on child documents.
>> I just committed a test for this, and it worked fine.  See
>> TestJsonFacets.testBlockJoin()
>>
>> Can we see an example of a parent document being indexed (i.e. along
>> with it's child documents)?
>>
>> -Yonik
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Json Facet api on nested doc

2015-11-19 Thread xavi jmlucjav
Hi,

I am trying to get some faceting with the json facet api on nested doc, but
I am having issues. Solr 5.3.1.

This query gest the buckets numbers ok:

curl http://shost:8983/solr/collection1/query -d 'q=*:*=0&
 json.facet={
   yearly-salaries : {
type: terms,
field: salary,
domain: { blockChildren : "parent:true" }
  }
 }
'
Salary is a field in child docs only. But if I add another facet outside
it, the inner one returns no data:

curl http://shost:8983/solr/collection1/query -d 'q=*:*=0&
 json.facet={
department:{
   type: terms,
   field: department,
   facet:{
   yearly-salaries : {
type: terms,
field: salary,
domain: { blockChildren : "parent:true" }
  }
  }
  }
 }
'
Results in:

"facets":{

 "count":3144071,

"department":{

"buckets":[{

"val":"Development",

"count":85707,

"yearly-salaries":{

"buckets":[]}},


department is field only in parent docs. Am I doing something wrong that I
am missing?
thanks
xavi


Schemaless mode and DIH

2015-08-06 Thread xavi jmlucjav
hi,

While working with DIH, I tried schemaless mode, and found out it does not
work if you are indexing with DIH. I could not find any issue or reference
to this in the mailing list, even if I found it a bit surprising nobody
tried that combination so far. Did anybody tested this before?

I managed to fix it for my small use case, I opened a ticket for it with
the patch https://issues.apache.org/jira/browse/SOLR-7882

thanks


BlendedInfixLookupFactory does not respect suggest.count in 5.2?

2015-06-07 Thread xavi jmlucjav
Hi,

I have a setup with AnalyzingInfixLookupFactory, suggest.count works. But
if I just replace:
s/AnalyzingInfixLookupFactory/BlendedInfixLookupFactory
suggest.count is not respected anymore, all suggestions are returned, so
making it virtually useless.

I am using RC4 that I believe is also being released.

xavi


Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
Thanks Toke for the input.

I think the plan is to facet only on class_u1, class_u2 for queries from
user1, etc. So faceting would not happen on all fields on a single query.
But still.

I did not design the schema, just found out about the number of fields and
advised again that, when they asked for a second opinion. We did not get to
discuss a different schema, but if we get to this point I will take that
plan into consideration for sure.

xavi

On Sat, May 30, 2015 at 10:17 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 xavi jmlucjav jmluc...@gmail.com wrote:
  They reason for such a large number of fields:
  - users create dynamically 'classes' of documents, say one user creates
 10
  classes on average
  - for each 'class', the fields are created like this:
 unique_id_+fieldname
  - there are potentially hundreds of thousands of users.

 Switch to a scheme where you control the names of fields outside of Solr,
 but share the fields internally:

 User 1 has 10 custom classes: u1_a, u1_b, u1_c, ... u1_j
 Internally they are mapped to class1, class2, class3, ... class10

 User 2 uses 2 classes: u2_horses, u2_elephants
 Internally they are mapped to class1, class2

 When User 2 queries field u2_horses, you rewrite the query to use class1
 instead.

  There is faceting in each users' fields.
  So this will result in 1M fields, very sparsely populated.

 If you are faceting on all of them and if you are not using DocValues,
 this will explode your memory requirements with vanilla Solr: UnInverted
 faceting maintains separate a map from all documentIDs to field values
 (ordinals for Strings) for _all_ the facet fields. Even if you only had 10
 million documents and even if your 1 million facet fields all had just 1
 value, represented by 1 bit, it would still require 10M * 1M * 1 bits in
 memory, which is 10 terabyte of RAM.

 - Toke Eskildsen



Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
They reason for such a large number of fields:
- users create dynamically 'classes' of documents, say one user creates 10
classes on average
- for each 'class', the fields are created like this: unique_id_+fieldname
- there are potentially hundreds of thousands of users.

There is faceting in each users' fields.

So this will result in 1M fields, very sparsely populated. I warned them
this did not sound like a good design to me, but apparently someone very
knowledgeable in solr said this will work out fine. That is why I wanted to
double check...

On Sat, May 30, 2015 at 9:22 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Anything more than a few hundred seems very suspicious.

 Anything more than a few dozen or 50 or 75 seems suspicious as well.

 The point should not be how crazy can you get with Solr, but that craziness
 should be avoided altogether!

 Solr's design is optimal for a large number of relatively small documents,
 not large documents.


 -- Jack Krupansky

 On Sat, May 30, 2015 at 3:05 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Nothing's really changed in that area lately. Your co-worker is
  perhaps confusing the statement that Solr has no a-priori limit on
  the number of distinct fields that can be in a corpus with supporting
  an infinite number of fields. Not having a built-in limit is much
  different than supporting
 
  Whether Solr breaks with thousands and thousands of fields is pretty
  dependent on what you _do_ with those fields. Simply doing keyword
  searches isn't going to put the same memory pressure on as, say,
  faceting on them all (even if in different queries).
 
  I'd really ask why so many fields are necessary though.
 
  Best,
  Erick
 
  On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com
 wrote:
   Hi guys,
  
   someone I work with has been advised that currently Solr can support
   'infinite' number of fields.
  
   I thought there was a practical limitation of say thousands of fields
  (for
   sure less than a million), orthings can start to break (I think I
   remember seeings memory issues reported on the mailing list by several
   people).
  
  
   Was there any change I missed lately that makes having say 1M fields in
   Solr practical??
  
   thanks
 



Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
On Sat, May 30, 2015 at 11:15 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 xavi jmlucjav jmluc...@gmail.com wrote:
  I think the plan is to facet only on class_u1, class_u2 for queries from
  user1, etc. So faceting would not happen on all fields on a single query.

 I understand that, but most of the created structures stays in memory
 between calls (DocValues helps here). Your heap will slowly fill up as more
 and more users perform faceted queries on their content.

got it...priceless info, thanks!



 - Toke Eskildsen



any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
Hi guys,

someone I work with has been advised that currently Solr can support
'infinite' number of fields.

I thought there was a practical limitation of say thousands of fields (for
sure less than a million), orthings can start to break (I think I
remember seeings memory issues reported on the mailing list by several
people).


Was there any change I missed lately that makes having say 1M fields in
Solr practical??

thanks