Re: Solr Basic Configuration - Highlight - Begginer

2015-12-17 Thread Teague James
Erik's comments not withstanding, there are some gaps in my understanding
of your precise situation. Here's a few things that weren't necessarily
obvious to me when I took my first try with Solr.

Highlighting is the end result of a good hit. It is essentially formatting
applied to your hit. It is possible to get a hit without a highlight if
certain conditions exist.

First, start by making sure you are indexing your target (a PDF file?)
correctly. Assuming you are indexing PDFs, are you extracting meta data
only or are you parsing the document with Tika? If you want hits on the
contents of your PDF, then you have to parse it at index time and store
that.That was why I suggested just running some queries through the
interface and the URL to see what Solr actually captured from your indexed
PDF before worrying about how it looks on the screen.

Next, you should look carefully at the Analyzer's output. Notice the
abbreviations to the left of the columns? Hover over those to see what
filter factory it is. When words are split into multiple columns at one of
those points, it indicates that the filter factory broke apart the word
while analyzing it. Do a search for the filter filter factories that you
find and read up on them. In my case "1a" was being split into 4 by a word
delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
to fail in my case while still getting a hit. It also caused erroneous hits
elsewhere. Adding some switches to the schema is all it took to correct
that for me. However, every case is different based on your needs. That is
why it is important to go through the analyzer and see if Solr's indexing
and querying are doing what you expect.

If that looks good and you've got solid hits all the way down, then it is
time to start looking at your highlighter implementation in the index and
query analyzers that you are using. My original issue of not being able to
highlight phrases with one set of tags necessitated me switching to the
fast vector highlighter - which had its own requirements for certain
parameters to be set. Here again - going to the Solr docs and reading up on
the various highlighters will be helpful in most cases.

Solr has a very steep learning curve. I've been using it for several years
and I still consider myself a noob. It can be a deep dive, but don't be
discouraged. Keep at it. Cheers!

-Teague

On Wed, Dec 16, 2015 at 8:54 PM, Evert R.  wrote:

> Hi Erick and Teague,
>
>
> I found that when using the field 'text' it shows the pdf file result
> id:pdf1 in this case, like:
>
> http://localhost:8983/solr/techproducts/select?fq=id:pdf1=nietava
>
> but when highlight, using the text field...nothing comes up...
>
>
> http://localhost:8983/solr/techproducts/select?q=text:nietava=id:pdf1=json=true=true=text=%3Cem%3E=%3C%2Fem%3E
>
> ​of even with the option
>
> f.text.hl.snippets=2 under the hl.fl field.
>
>
> I tried as well with the standard configuration, did it all over, reindexed
> a couple times... and still did not work.
>
> Also,
>
> Using the Analysis, it brings below information:
>
> ST
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]0711
> SF
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]0711
> LCF
> textraw_bytesstartendpositionLengthtypeposition
> nietava[6e 69 65 74 61 76 61]0711
> ​
>
> Alphanumeric I think... so, it´s 'string', right? would that be a problem?
> Should be some other indication?
>
>
> Thanks again!
>
>
> *Evert*
>
> 2015-12-16 21:09 GMT-02:00 Erick Erickson :
>
> > I think you're still missing the critical bit. Highlighting is
> > completely separate from searching. In other words, you can search on
> > one field and highlight another. What field is searched is governed by
> > the "qf" parameter when using edismax and by the the "df" parameter
> > configured in your request handler in solrconfig.xml. These defaults
> > are overridden when you do a "fielded search" like
> >
> > q=content:nietava
> >
> > So this: q=content:nietava=true=content
> > is searching the "content" field. The word you're looking for isn't in
> > the content field so naturally no docs are returned. And no
> > highlighting either.
> >
> > This: q=nietava=true=content
> >
> > is searching somewhere else, thus getting the hit. We already know
> > that "nietava" is not in the content field because the first search
> > failed. You need to find out what field is being matched (probably
> > something like "text") and then try highlighting on _that_ field. Try
> > adding "debug=query" to the URL and look at the "parsed_query" section
> > of the return and you'll see what field(s) is/are actually being
> > searched against.
> >
> > NOTE: The field you highlight on _must_ have stored="true" in schema.xml.
> >
> > As to why "nietava" isn't being found in the content field, probably
> > you have some kind of analysis chain configured for that field that
> 

add text analyzer to solr

2015-12-17 Thread sara hajili
hi.
i wanna to change solr analyzer , like normalization.
because solr default normalization for persian language does't satisfy me.
so i start reading solr plugins .
and i try to implement my PersianNormalization.
now i have 2 class in this way :
class persianNormalizer extends TokenFilter.
and another class:
class persianNormalizerFilterFactory extends TokenFilterFactory.
and then i create a jar file from in this 2 class.
and then i add this jar file in solr_home/dist and
solr_home/contrib/extraction
(i select this 2 folder because in solrConf.xml in tag lib this to Dir was
set as a lib dir)
and then in schema.xml i create a one textfield  that use
PersianAnlyzerFilterFactory
in this way :

but when solr comes up,i get error :
how i solve my problem.can any one tell me how to use my normalizer in solr?
or can any one tell me about existing tutorial that determine how write
text analyzer and use that in solr step by step?
the error that i faced with it in solr when adding core was :
and core didn't add to solr.
plz help me.
org.apache.solr.common.SolrException: Error CREATEing SolrCore 'post':
Unable to create core [post] Caused by:
solr.PersianCustomNormalizerFilterFactory


Re: add text analyzer to solr

2015-12-17 Thread Binoy Dalal
Why don't you post the entire stack trace from the logs. That might give us
a better idea to help you.

On Thu, 17 Dec 2015, 13:59 sara hajili  wrote:

> hi.
> i wanna to change solr analyzer , like normalization.
> because solr default normalization for persian language does't satisfy me.
> so i start reading solr plugins .
> and i try to implement my PersianNormalization.
> now i have 2 class in this way :
> class persianNormalizer extends TokenFilter.
> and another class:
> class persianNormalizerFilterFactory extends TokenFilterFactory.
> and then i create a jar file from in this 2 class.
> and then i add this jar file in solr_home/dist and
> solr_home/contrib/extraction
> (i select this 2 folder because in solrConf.xml in tag lib this to Dir was
> set as a lib dir)
> and then in schema.xml i create a one textfield  that use
> PersianAnlyzerFilterFactory
> in this way :
> 
> but when solr comes up,i get error :
> how i solve my problem.can any one tell me how to use my normalizer in
> solr?
> or can any one tell me about existing tutorial that determine how write
> text analyzer and use that in solr step by step?
> the error that i faced with it in solr when adding core was :
> and core didn't add to solr.
> plz help me.
> org.apache.solr.common.SolrException: Error CREATEing SolrCore 'post':
> Unable to create core [post] Caused by:
> solr.PersianCustomNormalizerFilterFactory
>
-- 
Regards,
Binoy Dalal


Re: pf2 pf3 and stopwords

2015-12-17 Thread Binoy Dalal
For this case of inversion in particular a slop of 1 won't cause any issues
since such a reverse match will require the slop to be 2

On Thu, 17 Dec 2015, 14:20 elisabeth benoit 
wrote:

> Inversion (paris charonne or charonne paris) cannot be scored the same.
>
> 2015-12-16 11:08 GMT+01:00 Binoy Dalal :
>
> > What is your exact use case?
> >
> > On Wed, 16 Dec 2015, 13:40 elisabeth benoit 
> > wrote:
> >
> > > Thanks for your answer.
> > >
> > > Actually, using a slop of 1 is something I can't do (because of other
> > > specifications)
> > >
> > > I guess I'll index differently.
> > >
> > > Best regards,
> > > Elisabeth
> > >
> > > 2015-12-14 16:24 GMT+01:00 Binoy Dalal :
> > >
> > > > Moreover, the stopword de will work on your queries and not on your
> > > > documents, meaning if you query 'Gare de Saint Lazare', the terms
> > > actually
> > > > searched for will be Gare Saint and Lazare, 'de' will be filtered
> out.
> > > >
> > > > On Mon, Dec 14, 2015 at 8:49 PM Binoy Dalal 
> > > > wrote:
> > > >
> > > > > This isn't a bug. During pf3 matching, since your query has only
> > three
> > > > > tokens, the entire query will be treated as a single phrase, and
> with
> > > > slop
> > > > > = 0, any word that comes in the middle of your query  - 'de' in
> this
> > > case
> > > > > will cause the phrase to not be matched. If you want to get around
> > > this,
> > > > > try setting your slop = 1 in which case it should match Gare Saint
> > > Lazare
> > > > > even with the de in it.
> > > > >
> > > > > On Mon, Dec 14, 2015 at 7:22 PM elisabeth benoit <
> > > > > elisaelisael...@gmail.com> wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> I am using solr 4.10.1. I have a field with stopwords
> > > > >>
> > > > >>
> > > > >>  > > > >> words="stopwords.txt"
> > > > >> enablePositionIncrements="true"/>
> > > > >>
> > > > >> And I use pf2 pf3 on that field with a slop of 0.
> > > > >>
> > > > >> If the request is "Gare Saint Lazare", and I have a document "Gare
> > de
> > > > >> Saint
> > > > >> Lazare", "de" being a stopword, this document doesn't get the pf3
> > > boost,
> > > > >> because of "de".
> > > > >>
> > > > >> I was wondering, is this normal? is this a bug? is something wrong
> > > with
> > > > my
> > > > >> configuration?
> > > > >>
> > > > >> Best regards,
> > > > >> Elisabeth
> > > > >>
> > > > > --
> > > > > Regards,
> > > > > Binoy Dalal
> > > > >
> > > > --
> > > > Regards,
> > > > Binoy Dalal
> > > >
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>
-- 
Regards,
Binoy Dalal


Re: Issues when indexing PDF files

2015-12-17 Thread Zheng Lin Edwin Yeo
Hi Alexandre,

Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?

Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch 
wrote:

> They could be using custom fonts and non-Unicode characters. That's
> probably something to explore with PDF specific tools.
> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
> wrote:
>
> > I've checked all the files which has problem with the content in the Solr
> > index using the Tika app. All of them shows the same issues as what I see
> > in the Solr index.
> >
> > So does the issues lies with the encoding of the file? Are we able to
> check
> > the encoding of the file?
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi Erik,
> > >
> > > I've shared the file on dropbox, which you can access via the link
> here:
> > >
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > >
> > > This is what I get from the Tika app after dropping the file in.
> > >
> > > Content-Length: 75092
> > > Content-Type: application/pdf
> > > Type: COSName{Info}
> > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > X-TIKA:digest:SHA256:
> > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > access_permission:assemble_document: true
> > > access_permission:can_modify: true
> > > access_permission:can_print: true
> > > access_permission:can_print_degraded: true
> > > access_permission:extract_content: true
> > > access_permission:extract_for_accessibility: true
> > > access_permission:fill_in_form: true
> > > access_permission:modify_annotations: true
> > > dc:format: application/pdf; version=1.3
> > > pdf:PDFVersion: 1.3
> > > pdf:encrypted: false
> > > producer: null
> > > resourceName: Desmophen+670+BAe.pdf
> > > xmpTPg:NPages: 3
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:15, Erik Hatcher 
> > wrote:
> > >
> > >> Edwin - Can you share one of those PDF files?
> > >>
> > >> Also, drop the file into the Tika app and see what it sees directly -
> > get
> > >> the tika-app JAR and run that desktop application.
> > >>
> > >> Could be an encoding issue?
> > >>
> > >> Erik
> > >>
> > >> —
> > >> Erik Hatcher, Senior Solutions Architect
> > >> http://www.lucidworks.com 
> > >>
> > >>
> > >>
> > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I'm using Solr 5.3.0
> > >> >
> > >> > I'm indexing some PDF documents. However, for certain PDF files,
> there
> > >> are
> > >> > chinese text in the documents, but after indexing, what is indexed
> in
> > >> the
> > >> > content is either a series of "??" or an empty content.
> > >> >
> > >> > I'm using the post.jar that comes together with Solr.
> > >> >
> > >> > What could be the reason that causes this?
> > >> >
> > >> > Regards,
> > >> > Edwin
> > >>
> > >>
> > >
> >
>


Re: pf2 pf3 and stopwords

2015-12-17 Thread elisabeth benoit
Inversion (paris charonne or charonne paris) cannot be scored the same.

2015-12-16 11:08 GMT+01:00 Binoy Dalal :

> What is your exact use case?
>
> On Wed, 16 Dec 2015, 13:40 elisabeth benoit 
> wrote:
>
> > Thanks for your answer.
> >
> > Actually, using a slop of 1 is something I can't do (because of other
> > specifications)
> >
> > I guess I'll index differently.
> >
> > Best regards,
> > Elisabeth
> >
> > 2015-12-14 16:24 GMT+01:00 Binoy Dalal :
> >
> > > Moreover, the stopword de will work on your queries and not on your
> > > documents, meaning if you query 'Gare de Saint Lazare', the terms
> > actually
> > > searched for will be Gare Saint and Lazare, 'de' will be filtered out.
> > >
> > > On Mon, Dec 14, 2015 at 8:49 PM Binoy Dalal 
> > > wrote:
> > >
> > > > This isn't a bug. During pf3 matching, since your query has only
> three
> > > > tokens, the entire query will be treated as a single phrase, and with
> > > slop
> > > > = 0, any word that comes in the middle of your query  - 'de' in this
> > case
> > > > will cause the phrase to not be matched. If you want to get around
> > this,
> > > > try setting your slop = 1 in which case it should match Gare Saint
> > Lazare
> > > > even with the de in it.
> > > >
> > > > On Mon, Dec 14, 2015 at 7:22 PM elisabeth benoit <
> > > > elisaelisael...@gmail.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I am using solr 4.10.1. I have a field with stopwords
> > > >>
> > > >>
> > > >>  > > >> words="stopwords.txt"
> > > >> enablePositionIncrements="true"/>
> > > >>
> > > >> And I use pf2 pf3 on that field with a slop of 0.
> > > >>
> > > >> If the request is "Gare Saint Lazare", and I have a document "Gare
> de
> > > >> Saint
> > > >> Lazare", "de" being a stopword, this document doesn't get the pf3
> > boost,
> > > >> because of "de".
> > > >>
> > > >> I was wondering, is this normal? is this a bug? is something wrong
> > with
> > > my
> > > >> configuration?
> > > >>
> > > >> Best regards,
> > > >> Elisabeth
> > > >>
> > > > --
> > > > Regards,
> > > > Binoy Dalal
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>


Partial update through DIH

2015-12-17 Thread Midas A
Hi,
can be do partial update trough Data import handler .

Regards,
Abhishek


RE: Expected mime type application/octet-stream but got text/html

2015-12-17 Thread Markus Jelsma
Hi - looks like Solr did not start up correctly, got some errors and kept Jetty 
running. You should find information in that node's logs.
M.
 
 
-Original message-
> From:Andrej van der Zee 
> Sent: Thursday 17th December 2015 10:32
> To: solr-user@lucene.apache.org
> Subject: Expected mime type application/octet-stream but got text/html
> 
> Hi,
> 
> I am having troubles getting data from a particular shard, even though I
> follow the documentation:
> 
> https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
> 
> This is OK:
> 
>  curl "
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true;
> {
>// returns correct result set
> }
> 
> But this is NOT OK when I specify a particular shard:
> 
> curl "
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=54.93.121.54:8986/solr
> "
> 
> {
>   "responseHeader":{
> "status":404,
> "QTime":5,
> "params":{
>   "q":"*:*",
>   "shards":"54.93.121.54:8986/solr",
>   "indent":"true",
>   "rows":"1000",
>   "wt":"json"}},
>   "error":{
> "msg":"Error from server at http://54.93.121.54:8986/solr: Expected
> mime type application/octet-stream but got text/html. \n\n http-equiv=\"Content-Type\" content=\"text/html;
> charset=UTF-8\"/>\nError 404 Not
> Found\n\nHTTP ERROR 404\nProblem accessing
> /solr/select. Reason:\nNot FoundPowered by
> Jetty://\n\n\n\n",
> "code":404}}
> 
> Any idea?
> 
> Thanks,
> Andrej
> 


Re: warning while indexing

2015-12-17 Thread Mikhail Khludnev
On Thu, Dec 17, 2015 at 8:00 AM, Midas A  wrote:

>
> org.apache.solr.update.CommitTracker._scheduleCommitWithinIfNeeded(CommitTracker.java:118)
>

I seems like you specifies commitWithin that's legal but seems unusual and
doubtful with DIH.

> > rejected from java.util.concurrent.ScheduledThreadPoolExecutor@79f8b5f
> > > [*Terminated*,
> > > pool size = 0, active thre
>
It seems like commitTracker already closed its' thread pool that may happen
if SolrCore already closed or is reloading.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Partial update through DIH

2015-12-17 Thread Mikhail Khludnev
hmm it's interesting, in according to the code you can create a transformer
which is doing what described at http://yonik.com/solr/atomic-updates/
in *Atomic
Updates with SolrJ*

It should/might work, but I've never tried.

On Thu, Dec 17, 2015 at 12:26 PM, Midas A  wrote:

> Hi,
> can be do partial update trough Data import handler .
>
> Regards,
> Abhishek
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Issues when indexing PDF files

2015-12-17 Thread Binoy Dalal
You can always write an update handler plugin to convert your PDFs to utf-8
and then push them to solr

On Thu, 17 Dec 2015, 14:16 Zheng Lin Edwin Yeo  wrote:

> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific tools
> and change the encoding of the file?
> Is there any way to configure it in Solr?
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> wrote:
>
> > They could be using custom fonts and non-Unicode characters. That's
> > probably something to explore with PDF specific tools.
> > On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
> > wrote:
> >
> > > I've checked all the files which has problem with the content in the
> Solr
> > > index using the Tika app. All of them shows the same issues as what I
> see
> > > in the Solr index.
> > >
> > > So does the issues lies with the encoding of the file? Are we able to
> > check
> > > the encoding of the file?
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > > wrote:
> > >
> > > > Hi Erik,
> > > >
> > > > I've shared the file on dropbox, which you can access via the link
> > here:
> > > >
> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> > > >
> > > > This is what I get from the Tika app after dropping the file in.
> > > >
> > > > Content-Length: 75092
> > > > Content-Type: application/pdf
> > > > Type: COSName{Info}
> > > > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > > > X-TIKA:digest:SHA256:
> > > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > > > access_permission:assemble_document: true
> > > > access_permission:can_modify: true
> > > > access_permission:can_print: true
> > > > access_permission:can_print_degraded: true
> > > > access_permission:extract_content: true
> > > > access_permission:extract_for_accessibility: true
> > > > access_permission:fill_in_form: true
> > > > access_permission:modify_annotations: true
> > > > dc:format: application/pdf; version=1.3
> > > > pdf:PDFVersion: 1.3
> > > > pdf:encrypted: false
> > > > producer: null
> > > > resourceName: Desmophen+670+BAe.pdf
> > > > xmpTPg:NPages: 3
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 17 December 2015 at 00:15, Erik Hatcher 
> > > wrote:
> > > >
> > > >> Edwin - Can you share one of those PDF files?
> > > >>
> > > >> Also, drop the file into the Tika app and see what it sees directly
> -
> > > get
> > > >> the tika-app JAR and run that desktop application.
> > > >>
> > > >> Could be an encoding issue?
> > > >>
> > > >> Erik
> > > >>
> > > >> —
> > > >> Erik Hatcher, Senior Solutions Architect
> > > >> http://www.lucidworks.com 
> > > >>
> > > >>
> > > >>
> > > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> > > edwinye...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > I'm using Solr 5.3.0
> > > >> >
> > > >> > I'm indexing some PDF documents. However, for certain PDF files,
> > there
> > > >> are
> > > >> > chinese text in the documents, but after indexing, what is indexed
> > in
> > > >> the
> > > >> > content is either a series of "??" or an empty content.
> > > >> >
> > > >> > I'm using the post.jar that comes together with Solr.
> > > >> >
> > > >> > What could be the reason that causes this?
> > > >> >
> > > >> > Regards,
> > > >> > Edwin
> > > >>
> > > >>
> > > >
> > >
> >
>
-- 
Regards,
Binoy Dalal


Re: faceting is unusable slow since upgrade to 5.3.0

2015-12-17 Thread Mikhail Khludnev
This fix definitely help for facet.field over docvalues field on
mult-segment index since 5.4.
I suppose it's irrelevant to JSON Facets, non-dv field, and pre 5.4.
I can not comment about comparing perfomance of dv and non-dv fields,
because "it depends" (с) benchmarking and profiler are the only advisers.

On Thu, Dec 17, 2015 at 9:22 AM, William Bell  wrote:

> Same question here
>
> Wondering if faceting performance is fixed and how to take advantage of it
> ?
>
> On Wed, Dec 16, 2015 at 2:57 AM, Vincenzo D'Amore 
> wrote:
>
> > Hi all,
> >
> > given that solr 5.4 is finally released, is this what's more stable and
> > efficient version of solrcloud ?
> >
> > I have a website which receives many search requests. It serve normally
> > about 2000 concurrent requests, but sometime there are peak from 4000 to
> > 1 requests in few seconds.
> >
> > On January I'll have a chance to upgrade my old SolrCloud 4.8.1 cluster
> to
> > a new brand version, but following this thread I read about the problems
> > that can occur upgrading to latest version.
> >
> > I have seen that issue SOLR-7730 "speed-up faceting on doc values fields"
> > is fixed in 5.4.
> >
> > I'm using standard faceting without docValues. Should I add docValues in
> > order to benefit of such fix?
> >
> > Best regards,
> > Vincenzo
> >
> >
> >
> > On Thu, Oct 8, 2015 at 2:22 PM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com
> > > wrote:
> >
> > > Uwe, it's good to know! I mean that you've recovered. Take care!
> > >
> > > On Thu, Oct 8, 2015 at 1:24 PM, Uwe Reh 
> > > wrote:
> > >
> > > > Sorry for the delay. I had an ugly flu.
> > > >
> > > > SOLR-7730 seems to work fine. Using docValues with Solr
> > > > 5.4.0-2015-09-29_08-29-55 1705813 makes my faceted queries fast
> again.
> > > > (90ms vs. 2ms) :-)
> > > >
> > > > Thanks
> > > > Uwe
> > > >
> > > >
> > > >
> > > >
> > > > Am 27.09.2015 um 20:32 schrieb Mikhail Khludnev:
> > > >
> > > >> On Sun, Sep 27, 2015 at 2:00 PM, Uwe Reh <
> r...@hebis.uni-frankfurt.de>
> > > >> wrote:
> > > >>
> > > >> When 5.4 with SOLR-7730 will be released, I will start to use
> > docValues.
> > > >>> Going this way, seems more straight forward to me.
> > > >>>
> > > >>
> > > >>
> > > >> Sure. Giving your answers docValues facets has a really good chance
> to
> > > >> perform in your index after SOLR-7730. It's really interesting to
> see
> > > >> performance numbers on early 5.4 builds:
> > > >>
> > > >>
> > >
> >
> https://builds.apache.org/view/All/job/Solr-Artifacts-5.x/lastSuccessfulBuild/artifact/solr/package/
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > 
> > > 
> > >
> >
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.dam...@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
> >
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Expected mime type application/octet-stream but got text/html

2015-12-17 Thread Andrej van der Zee
Hi,

I am having troubles getting data from a particular shard, even though I
follow the documentation:

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests

This is OK:

 curl "
http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true;
{
   // returns correct result set
}

But this is NOT OK when I specify a particular shard:

curl "
http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=54.93.121.54:8986/solr
"

{
  "responseHeader":{
"status":404,
"QTime":5,
"params":{
  "q":"*:*",
  "shards":"54.93.121.54:8986/solr",
  "indent":"true",
  "rows":"1000",
  "wt":"json"}},
  "error":{
"msg":"Error from server at http://54.93.121.54:8986/solr: Expected
mime type application/octet-stream but got text/html. \n\n\nError 404 Not
Found\n\nHTTP ERROR 404\nProblem accessing
/solr/select. Reason:\nNot FoundPowered by
Jetty://\n\n\n\n",
"code":404}}

Any idea?

Thanks,
Andrej


A field _indexed_at_tdt added when I index documents.

2015-12-17 Thread Guillermo Ortiz
I'm indexing documents in solr with Spark and it's missing the a field
 _indexed_at_tdt who is doesn't exist in my documents.

I have added this field in my schema, why is this field being added? any
solution?


Re: Expected mime type application/octet-stream but got text/html

2015-12-17 Thread Andrej van der Zee
It turns out that the documentation is not correct. If I specify the
collection name after shards=, it does work as expected. So this works:
curl "
http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=1000=54.93.121.54:8986/solr/connects
"

This does not work:
curl "
http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=1000=54.93.121.54:8986/solr
"

So I guess the documentation needs an update?

Cheers,
Andrej


On Thu, Dec 17, 2015 at 10:36 AM, Markus Jelsma 
wrote:

> Hi - looks like Solr did not start up correctly, got some errors and kept
> Jetty running. You should find information in that node's logs.
> M.
>
>
> -Original message-
> > From:Andrej van der Zee 
> > Sent: Thursday 17th December 2015 10:32
> > To: solr-user@lucene.apache.org
> > Subject: Expected mime type application/octet-stream but got text/html
> >
> > Hi,
> >
> > I am having troubles getting data from a particular shard, even though I
> > follow the documentation:
> >
> > https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
> >
> > This is OK:
> >
> >  curl "
> >
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true;
> > {
> >// returns correct result set
> > }
> >
> > But this is NOT OK when I specify a particular shard:
> >
> > curl "
> >
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=54.93.121.54:8986/solr
> > "
> >
> > {
> >   "responseHeader":{
> > "status":404,
> > "QTime":5,
> > "params":{
> >   "q":"*:*",
> >   "shards":"54.93.121.54:8986/solr",
> >   "indent":"true",
> >   "rows":"1000",
> >   "wt":"json"}},
> >   "error":{
> > "msg":"Error from server at http://54.93.121.54:8986/solr: Expected
> > mime type application/octet-stream but got text/html.
> \n\n > http-equiv=\"Content-Type\" content=\"text/html;
> > charset=UTF-8\"/>\nError 404 Not
> > Found\n\nHTTP ERROR 404\nProblem
> accessing
> > /solr/select. Reason:\nNot FoundPowered
> by
> > Jetty://\n\n\n\n",
> > "code":404}}
> >
> > Any idea?
> >
> > Thanks,
> > Andrej
> >
>


RE: Issues when indexing PDF files

2015-12-17 Thread Allison, Timothy B.
Generally, I'd recommend opening an issue on PDFBox's Jira with the file that 
you shared.  Tika uses PDFBox...if a fix can be made there, it will propagate 
back through Tika to Solr.

That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode 
mapping for CID+71 (71) in font 505Eddc6Arial

So, if the file has no Unicode mapping for the font, I doubt they'll be able to 
fix it.

pdftotext is also unable to extract anything useful from the file.

Sorry.

Best,

Tim
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, December 17, 2015 5:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Issues when indexing PDF files

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific 
> tools and change the encoding of the file?
> Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
in a way that Tika cannot easily extract the text, there's nothing you can do 
in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a character on a 
PDF page, but exactly how that character is generated (using a specific 
encoding, font, or even by drawing a picture) is outside your control. There 
are various businesses built on this premise
- they charge for creating clean extracted text from PDFs - and even they have 
trouble with some PDFs.

HTH

Charlie

>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> 
> wrote:
>
>> They could be using custom fonts and non-Unicode characters. That's 
>> probably something to explore with PDF specific tools.
>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>> wrote:
>>
>>> I've checked all the files which has problem with the content in the 
>>> Solr index using the Tika app. All of them shows the same issues as 
>>> what I see in the Solr index.
>>>
>>> So does the issues lies with the encoding of the file? Are we able 
>>> to
>> check
>>> the encoding of the file?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
>>> 
>>> wrote:
>>>
 Hi Erik,

 I've shared the file on dropbox, which you can access via the link
>> here:

>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?d
>> l=0

 This is what I get from the Tika app after dropping the file in.

 Content-Length: 75092
 Content-Type: application/pdf
 Type: COSName{Info}
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
 X-TIKA:digest:SHA256:
 d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
 access_permission:assemble_document: true
 access_permission:can_modify: true
 access_permission:can_print: true
 access_permission:can_print_degraded: true
 access_permission:extract_content: true
 access_permission:extract_for_accessibility: true
 access_permission:fill_in_form: true
 access_permission:modify_annotations: true
 dc:format: application/pdf; version=1.3
 pdf:PDFVersion: 1.3
 pdf:encrypted: false
 producer: null
 resourceName: Desmophen+670+BAe.pdf
 xmpTPg:NPages: 3


 Regards,
 Edwin


 On 17 December 2015 at 00:15, Erik Hatcher 
>>> wrote:

> Edwin - Can you share one of those PDF files?
>
> Also, drop the file into the Tika app and see what it sees 
> directly -
>>> get
> the tika-app JAR and run that desktop application.
>
> Could be an encoding issue?
>
>  Erik
>
> —
> Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com 
> 
>
>
>
>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I'm using Solr 5.3.0
>>
>> I'm indexing some PDF documents. However, for certain PDF files,
>> there
> are
>> chinese text in the documents, but after indexing, what is 
>> indexed
>> in
> the
>> content is either a series of "??" or an empty content.
>>
>> I'm using the post.jar that comes together with Solr.
>>
>> What could be the reason that causes this?
>>
>> Regards,
>> Edwin
>
>

>>>
>>
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr 6 Distributed Join

2015-12-17 Thread Akiel Ahmed
Hi again,

I got the join to work. A team mate pointed out that one of the search 
functions in the innerJoin query was missing a field in the join - adding 
the e1 field to the fl parameter of the second search function gave the 
result I expected:

http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

, fl="id", q=text:John, sort="id 
asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, 
fl="id,e1", q=text:Friends, sort="id 
asc",zkHost="localhost:9983",qt="/export"), on="id=e1")

I am still interested in whether we can specify a join, using an arbitrary 
number of searches.

Cheers

Akiel



From:   Akiel Ahmed/UK/IBM@IBMGB
To: solr-user@lucene.apache.org
Date:   16/12/2015 17:05
Subject:Re: Solr 6 Distributed Join



Hi Dennis,

Thank you for your help. I used your explanation to construct an innerJoin 

query; I think I am getting further but didn't get the results I expected. 

The following describes what I did – is there any chance you can tell 
where I am going wrong:

Solr 6 Developer Builds: #2738 and #2743

1. Modified server/solr/configsets/basic_configs/conf/managed-schema so it 

reads:



  id
  
  
  
  
  
  
  
  
  

  
  
  
  

  


2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml, 
adding the following near the bottom of the file so it is the last request 

handler

   
 
json 
false 
 
  

3. Used solr -e cloud to setup a solr cloud instance, picking all the 
defaults except I chose basic_configs

4. After solr is running I ingested the following data via the Solr Web UI 

(/update handler, Document Type = CSV)
id,type,e1,e2,text
1,ABC,,,John Smith
2,ABC,,,Jane Smith
3,ABC,,,MiKe Smith
4,ABC,,,John Doe
5,ABC,,,Jane Doe
6,ABC,,,MiKe Doe
7,ABC,,,John Smith
8,DEF,,,Chicken Burger
9,DEF,,,Veggie Burger
10,DEF,,,Beef Burger
11,DEF,,,Chicken Donar
12,DEF,,,Chips
13,DEF,,,Drink
20,GHI,1,2,Friends
21,GHI,3,4,Friends
22,GHI,5,6,Friends
23,GHI,7,6,Friends
24,GHI,6,4,Friends
25,JKL,1,8,Order
26,JKL,2,9,Order
27,JKL,3,10,Order
28,JKL,4,11,Order
29,JKL,5,12,Order
30,JKL,6,13,Order

5. Navigating to the following URL in a browser returned an expected 
result:
http://localhost:8983/solr/gettingstarted/select?q={!join from=id 
to=e1}text:John="id"


...
  

  20
  1
  2
  ...


  28
  4
  11
  ...


  23
  7
  6
  ...

  


6. Navigating to the following URL in a browser does NOT return what I 
expected:
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

, fl="id", q=text:John, sort="id 
asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted, 
fl="id", q=text:Friends, sort="id 
asc",zkHost="localhost:9983",qt="/export"), on="id=e1")

{"result-set":{"docs":[
{"EOF":true,"RESPONSE_TIME":124}]}}


I also have a join related question. Is there any chance I can specify a 
query and join for more than 2 things. For example:

innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1, 
  search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
  search(gettingstarted, fl="id", q=text:Friends, ...) as s3)
  on="s1.id=s3.e1", 
  on="s2.id=s3.e2")
 
Sorry if the query does not make sense, but given the data above my 
intention is to find a single result made up of 3 documents: 
s1.id=1,s2.id=8,s3.id=25
Is that possible? If yes, will Solr 6 support an arbitrary number of 
queries and associated joins?

Cheers

Akiel



From:   Dennis Gove 
To: Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org
Date:   11/12/2015 15:34
Subject:Re: Solr 6 Distributed Join



Akiel,

Without seeing your full url I assume that you're missing the
stream=innerJoin(.) part of it. A full sample url would look like this
http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers,
fl="personId,companyId,title", q=companyId:*, sort="companyId
asc",zkHost="localhost:2181",qt="/export"),search(companies,
fl="id,companyName", q=*:*, sort="id
asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")

This example will return a join of career records with the company name 
for
all career records with a non-null companyId.

And the pieces have the following meaning:
http://localhost:8983/solr/careers/stream?  - you have a collection called
careers available on localhost:8983 and you're hitting its stream handler
?stream=  - you are passing the stream parameter to the stream handler
zkHost="localhost:2181"  - there is a zk instance running on 
localhost:2181
where solr can get clusterstate information. Note, that since you're
sending the request to the careers collection this param is not required 
in
the search(careers) part but is required in the search(companies)
part. For simplicity I usually just provide it for all.
qt="/export"  - tells solr to use the export 

Problem with Solr indexing "non-searchable" pdf files

2015-12-17 Thread RICARDO EITO BRUN
Hi,
I am using SOLR as part of the dspace 5.4 SW application.
I have a problem when running the dspace indexing command
(index-discovery). Most of the files are not being added to the index, and
an exception is raised.

It seems that Solr does not process the PDF files that are result of
scanning without OCR (non-searchable PDF files).

Is there any way to tell Solr that the document metadata should be
processed even if the PDF file itself cannot be indexed?

Any suggestion on how to make the pdf files "searchable" using some kind of
batch process/tool?

Thanks in advance,
Ricardo

-- 
RICARDO EITO BRUN
Universidad Carlos III de Madrid


Re: propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Adrien Grand
Hi Markus,

This is indeed related to LUCENE-6590: query boosts are now applied with
BoostQuery and if Query.setBoost is called on a query, its rewrite
implementation needs to rewrite to a BoostQuery. You can do that by
prepending the following to your rewrite(IndexReader) implementation:

if (getBoost() != 1f) { return super.rewrite(reader); }


Le jeu. 17 déc. 2015 à 13:23, Markus Jelsma  a
écrit :

> Hi,
>
> Apologies for the cross post. We have a class overridding
> SpanPositionRangeQuery. It is similar to a SpanFirst query but it is
> capable of adjusting the boost value with regard to distance. With the 5.4
> upgrade the unit tests suddenly threw the following exception:
>
> Query class org.GrSpanFirstQuery does not propagate Query.rewrite call to
> super.rewrite
> at
> __randomizedtesting.SeedInfo.seed([CA3D7CF96D5E8E7:88BE883E6CA09E3F]:0)
> at junit.framework.Assert.fail(Assert.java:57)
> at junit.framework.Assert.assertTrue(Assert.java:22)
> at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:73)
> at
> org.apache.lucene.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:83)
> at
> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:886)
> at
> org.apache.lucene.search.AssertingIndexSearcher.createNormalizedWeight(AssertingIndexSearcher.java:58)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744)
> at
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:460)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:489)
>
> I tracked it down to LUCENE-6590 - Explore different ways to apply boosts,
> but the solution did not really pop in my head right away. Implementing
> rewrite does not seem to change anything. Everything fails in the unit test
> at the point i want to retrieve docs and assert their positions in the
> result set: ScoreDoc[] docs = searcher.search(spanfirstquery, 10).scoreDocs;
>
> I am probably missing something but any ideas to share?
>
> Many thanks!
> Markus
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: faceting is unusable slow since upgrade to 5.3.0

2015-12-17 Thread Yonik Seeley
On Wed, Dec 16, 2015 at 4:57 AM, Vincenzo D'Amore  wrote:
> Hi all,
>
> given that solr 5.4 is finally released, is this what's more stable and
> efficient version of solrcloud ?
>
> I have a website which receives many search requests. It serve normally
> about 2000 concurrent requests, but sometime there are peak from 4000 to
> 1 requests in few seconds.
>
> On January I'll have a chance to upgrade my old SolrCloud 4.8.1 cluster to
> a new brand version, but following this thread I read about the problems
> that can occur upgrading to latest version.
>
> I have seen that issue SOLR-7730 "speed-up faceting on doc values fields"
> is fixed in 5.4.
>
> I'm using standard faceting without docValues. Should I add docValues in
> order to benefit of such fix?

You'll have to try it I think...
DocValues have a lot of advantages (much less heap consumption, and
much smaller overhead when opening a new searcher), but they can often
be slower as well.

Comparing 4x to 5x non-docvalues, top-level field caches were removed
by lucene, and while that benefits certain things like NRT (opening a
new searcher very often), it will hurt performance for other
configurations.

The JSON Facet API currently allows you to pick your strategy via the
"method" param for multi-valued string fields without docvalues:
"uif" (UninvertedField) gets you the top-level strategy from Solr 4,
while "dv" (DocValues built on-the-fly) gets you the NRT-friendly
"per-segment" strategy.

-Yonik


SolR 5.3.1 deletes index files

2015-12-17 Thread Moll, Dr. Andreas
Hi,

we are using SolR for some years now and are currently switching from SolR 3.6 
to 5.3.1.
SolR 5.3.1 deletes all index files when it shuts down and there were external 
changes on the index-files
(in our case from a second SolR-server which produces the index).

Is this behaviour intentional?
Can SolR be configured to handle the index-files in a read-only mode?

Thanks and best regards

Andreas Moll


Vertraulichkeitshinweis
Diese Information und jeder uebermittelte Anhang beinhaltet vertrauliche 
Informationen und ist nur fuer die Personen oder das Unternehmen bestimmt, an 
welche sie tatsaechlich gerichtet ist. Sollten Sie nicht der 
Bestimmungsempfaenger sein, weisen wir Sie darauf hin, dass die Verbreitung, 
das (auch teilweise) Kopieren sowie der Gebrauch der empfangenen E-Mail und der 
darin enthaltenen Informationen gesetzlich verboten sein kann und 
gegebenenfalls Schadensersatzpflichten ausloesen kann. Sollten Sie diese 
Nachricht aufgrund eines Uebermittlungsfehlers erhalten haben, bitten wir Sie 
den Sender unverzueglich hiervon in Kenntnis zu setzen.
Sicherheitswarnung: Bitte beachten Sie, dass das Internet kein sicheres 
Kommunikationsmedium ist. Obwohl wir im Rahmen unseres Qualitaetsmanagements 
und der gebotenen Sorgfalt Schritte eingeleitet haben, um einen 
Computervirenbefall weitestgehend zu verhindern, koennen wir wegen der Natur 
des Internets das Risiko eines Computervirenbefalls dieser E-Mail nicht 
ausschliessen.


Re: Solr 6 Distributed Join

2015-12-17 Thread Joel Bernstein
The innerJoin joins two streams sorted by the same join keys (merge join).
If third stream has the same join keys you can nest innerJoins. But all
three tables need to be sorted by the same join keys to nest innerJoins
(merge joins).

innerJoin(innerJoin(...),
search(...),
on...)

If the third stream is joined on a different key you can nest inside a
hashJoin which doesn't require streams to be sorted on the join key. For
example:

hashJoin(innerJoin(...),
hashed=search(...),
on..)


Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed  wrote:

> Hi again,
>
> I got the join to work. A team mate pointed out that one of the search
> functions in the innerJoin query was missing a field in the join - adding
> the e1 field to the fl parameter of the second search function gave the
> result I expected:
>
>
> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>
> , fl="id", q=text:John, sort="id
> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> fl="id,e1", q=text:Friends, sort="id
> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>
> I am still interested in whether we can specify a join, using an arbitrary
> number of searches.
>
> Cheers
>
> Akiel
>
>
>
> From:   Akiel Ahmed/UK/IBM@IBMGB
> To: solr-user@lucene.apache.org
> Date:   16/12/2015 17:05
> Subject:Re: Solr 6 Distributed Join
>
>
>
> Hi Dennis,
>
> Thank you for your help. I used your explanation to construct an innerJoin
>
> query; I think I am getting further but didn't get the results I expected.
>
> The following describes what I did – is there any chance you can tell
> where I am going wrong:
>
> Solr 6 Developer Builds: #2738 and #2743
>
> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema so it
>
> reads:
>
> 
> 
>   id
>multiValued="false" docValues="true"/>
>   
> required="false" multiValued="false" docValues="true"/>
>required="false" multiValued="false" docValues="true"/>
>   
> multiValued="false" docValues="true"/>
>   
> multiValued="false" docValues="true"/>
>required="false" multiValued="false"/>
>   
>precisionStep="0" positionIncrementGap="0"/>
>positionIncrementGap="100">
> 
>   
>   
>generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>words="lang/stopwords_en.txt"/>
> 
>   
> 
>
> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
> adding the following near the bottom of the file so it is the last request
>
> handler
>
>   
> 
> json
> false
> 
>   
>
> 3. Used solr -e cloud to setup a solr cloud instance, picking all the
> defaults except I chose basic_configs
>
> 4. After solr is running I ingested the following data via the Solr Web UI
>
> (/update handler, Document Type = CSV)
> id,type,e1,e2,text
> 1,ABC,,,John Smith
> 2,ABC,,,Jane Smith
> 3,ABC,,,MiKe Smith
> 4,ABC,,,John Doe
> 5,ABC,,,Jane Doe
> 6,ABC,,,MiKe Doe
> 7,ABC,,,John Smith
> 8,DEF,,,Chicken Burger
> 9,DEF,,,Veggie Burger
> 10,DEF,,,Beef Burger
> 11,DEF,,,Chicken Donar
> 12,DEF,,,Chips
> 13,DEF,,,Drink
> 20,GHI,1,2,Friends
> 21,GHI,3,4,Friends
> 22,GHI,5,6,Friends
> 23,GHI,7,6,Friends
> 24,GHI,6,4,Friends
> 25,JKL,1,8,Order
> 26,JKL,2,9,Order
> 27,JKL,3,10,Order
> 28,JKL,4,11,Order
> 29,JKL,5,12,Order
> 30,JKL,6,13,Order
>
> 5. Navigating to the following URL in a browser returned an expected
> result:
> http://localhost:8983/solr/gettingstarted/select?q={!join from=id
> to=e1}text:John="id"
>
> 
> ...
>   
> 
>   20
>   1
>   2
>   ...
> 
> 
>   28
>   4
>   11
>   ...
> 
> 
>   23
>   7
>   6
>   ...
> 
>   
> 
>
> 6. Navigating to the following URL in a browser does NOT return what I
> expected:
>
> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>
> , fl="id", q=text:John, sort="id
> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> fl="id", q=text:Friends, sort="id
> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>
> {"result-set":{"docs":[
> {"EOF":true,"RESPONSE_TIME":124}]}}
>
>
> I also have a join related question. Is there any chance I can specify a
> query and join for more than 2 things. For example:
>
> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1,
>   search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
>   search(gettingstarted, fl="id", q=text:Friends, ...) as s3)
>   on="s1.id=s3.e1",
>   on="s2.id=s3.e2")
>
> Sorry if the query does not make sense, but given the data above my
> intention is to find a single result made up of 3 documents:
> s1.id=1,s2.id=8,s3.id=25
> Is that possible? If yes, will Solr 6 support an arbitrary number of
> queries 

Re: SolR 5.3.1 deletes index files

2015-12-17 Thread Shawn Heisey
On 12/17/2015 8:00 AM, Moll, Dr. Andreas wrote:
> we are using SolR for some years now and are currently switching from SolR 
> 3.6 to 5.3.1.
> SolR 5.3.1 deletes all index files when it shuts down and there were external 
> changes on the index-files
> (in our case from a second SolR-server which produces the index).

I have *never* seen Solr delete the index files without outside
influence.  Either there's a misconfiguration, or something in your
environment is doing the delete.

If the DirectoryFactory were changed to use RAMDirectoryFactory, then
all the data would be in memory, and that would be purged on shutdown,
because it doesn't exist anywhere else.

Assuming you're using a standard directory implementation that puts
files on the disk, there is only one feature that I'm aware of that can
automatically delete information -- it's possible to index documents
with an expiration date, so it's automatically deleted once the
expiration date is reached.  I would not expect this to delete
everything on shutdown, though.

To figure out what's going on, we will need information about your
server, exactly how you installed Solr, how it is started, how it is
stopped, etc.

Thanks,
Shawn



Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood
PDF isn’t really text. For example, it doesn’t have spaces, it just moves the 
next letter over farther. Letters might not be in reading order — two column 
text could be printed as horizontal scans. Custom fonts might not use an 
encoding that matches Unicode, which makes them encrypted (badly). And so on.

As one of my coworkers said, trying to turn a PDF into structured text is like 
trying to turn hamburger back into a cow.

PDF is where text goes to die.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
> 
> On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> Hi Alexandre,
>> 
>> Thanks for your reply.
>> 
>> So the only way to solve this issue is to explore with PDF specific tools
>> and change the encoding of the file?
>> Is there any way to configure it in Solr?
> 
> Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
> in a way that Tika cannot easily extract the text, there's nothing you can do 
> in Solr that will help.
> 
> Unfortunately PDF isn't a content format but a presentation format - so 
> extracting plain text is fraught with difficulty. You may see a character on 
> a PDF page, but exactly how that character is generated (using a specific 
> encoding, font, or even by drawing a picture) is outside your control. There 
> are various businesses built on this premise - they charge for creating clean 
> extracted text from PDFs - and even they have trouble with some PDFs.
> 
> HTH
> 
> Charlie
> 
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
>> wrote:
>> 
>>> They could be using custom fonts and non-Unicode characters. That's
>>> probably something to explore with PDF specific tools.
>>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>>> wrote:
>>> 
 I've checked all the files which has problem with the content in the Solr
 index using the Tika app. All of them shows the same issues as what I see
 in the Solr index.
 
 So does the issues lies with the encoding of the file? Are we able to
>>> check
 the encoding of the file?
 
 
 Regards,
 Edwin
 
 
 On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
 wrote:
 
> Hi Erik,
> 
> I've shared the file on dropbox, which you can access via the link
>>> here:
> 
>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> 
> This is what I get from the Tika app after dropping the file in.
> 
> Content-Length: 75092
> Content-Type: application/pdf
> Type: COSName{Info}
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> X-TIKA:digest:SHA256:
> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> access_permission:assemble_document: true
> access_permission:can_modify: true
> access_permission:can_print: true
> access_permission:can_print_degraded: true
> access_permission:extract_content: true
> access_permission:extract_for_accessibility: true
> access_permission:fill_in_form: true
> access_permission:modify_annotations: true
> dc:format: application/pdf; version=1.3
> pdf:PDFVersion: 1.3
> pdf:encrypted: false
> producer: null
> resourceName: Desmophen+670+BAe.pdf
> xmpTPg:NPages: 3
> 
> 
> Regards,
> Edwin
> 
> 
> On 17 December 2015 at 00:15, Erik Hatcher 
 wrote:
> 
>> Edwin - Can you share one of those PDF files?
>> 
>> Also, drop the file into the Tika app and see what it sees directly -
 get
>> the tika-app JAR and run that desktop application.
>> 
>> Could be an encoding issue?
>> 
>> Erik
>> 
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com 
>> 
>> 
>> 
>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm using Solr 5.3.0
>>> 
>>> I'm indexing some PDF documents. However, for certain PDF files,
>>> there
>> are
>>> chinese text in the documents, but after indexing, what is indexed
>>> in
>> the
>>> content is either a series of "??" or an empty content.
>>> 
>>> I'm using the post.jar that comes together with Solr.
>>> 
>>> What could be the reason that causes this?
>>> 
>>> Regards,
>>> Edwin
>> 
>> 
> 
 
>>> 
>> 
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk



Re: Solr 6 Distributed Join

2015-12-17 Thread Joel Bernstein
Below is an example of nested joins where the innerJoin is done in parallel
using the parallel function. The partitionKeys parameter needs to be added
to the searches when the parallel function is used to partition the results
across worker nodes.

hashJoin(
parallel(workerCollection,
innerJoin(
search(users, q="*:*",
fl="userId, full_name, hometown", sort="userId asc", zkHost="zk2:2345",
qt="/export" partitionKeys="userId"),
search(reviews, q="*:*",
fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
qt="/export" partitionKeys="userId"),
on="userId"
),
 workers="20",
 zkHost="zk1:2345",
 sort="userId asc"
 ),
   hashed=search(restaurants, q="city:nyc",
fl="restaurantId, restaurantName",
sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
   on="restaurantId"
)


Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein  wrote:

> The innerJoin joins two streams sorted by the same join keys (merge join).
> If third stream has the same join keys you can nest innerJoins. But all
> three tables need to be sorted by the same join keys to nest innerJoins
> (merge joins).
>
> innerJoin(innerJoin(...),
> search(...),
> on...)
>
> If the third stream is joined on a different key you can nest inside a
> hashJoin which doesn't require streams to be sorted on the join key. For
> example:
>
> hashJoin(innerJoin(...),
> hashed=search(...),
> on..)
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed  wrote:
>
>> Hi again,
>>
>> I got the join to work. A team mate pointed out that one of the search
>> functions in the innerJoin query was missing a field in the join - adding
>> the e1 field to the fl parameter of the second search function gave the
>> result I expected:
>>
>>
>> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>>
>> , fl="id", q=text:John, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>> fl="id,e1", q=text:Friends, sort="id
>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>
>> I am still interested in whether we can specify a join, using an arbitrary
>> number of searches.
>>
>> Cheers
>>
>> Akiel
>>
>>
>>
>> From:   Akiel Ahmed/UK/IBM@IBMGB
>> To: solr-user@lucene.apache.org
>> Date:   16/12/2015 17:05
>> Subject:Re: Solr 6 Distributed Join
>>
>>
>>
>> Hi Dennis,
>>
>> Thank you for your help. I used your explanation to construct an innerJoin
>>
>> query; I think I am getting further but didn't get the results I expected.
>>
>> The following describes what I did – is there any chance you can tell
>> where I am going wrong:
>>
>> Solr 6 Developer Builds: #2738 and #2743
>>
>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema so it
>>
>> reads:
>>
>> 
>> 
>>   id
>>   > multiValued="false" docValues="true"/>
>>   >
>> required="false" multiValued="false" docValues="true"/>
>>   > required="false" multiValued="false" docValues="true"/>
>>   >
>> multiValued="false" docValues="true"/>
>>   >
>> multiValued="false" docValues="true"/>
>>   > required="false" multiValued="false"/>
>>   
>>   > precisionStep="0" positionIncrementGap="0"/>
>>   > positionIncrementGap="100">
>> 
>>   
>>   
>>   > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>>   > words="lang/stopwords_en.txt"/>
>> 
>>   
>> 
>>
>> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
>> adding the following near the bottom of the file so it is the last request
>>
>> handler
>>
>>   
>> 
>> json
>> false
>> 
>>   
>>
>> 3. Used solr -e cloud to setup a solr cloud instance, picking all the
>> defaults except I chose basic_configs
>>
>> 4. After solr is running I ingested the following data via the Solr Web UI
>>
>> (/update handler, Document Type = CSV)
>> id,type,e1,e2,text
>> 1,ABC,,,John Smith
>> 2,ABC,,,Jane Smith
>> 3,ABC,,,MiKe Smith
>> 4,ABC,,,John Doe
>> 5,ABC,,,Jane Doe
>> 6,ABC,,,MiKe Doe
>> 7,ABC,,,John Smith
>> 8,DEF,,,Chicken Burger
>> 9,DEF,,,Veggie Burger
>> 10,DEF,,,Beef Burger
>> 11,DEF,,,Chicken Donar
>> 12,DEF,,,Chips
>> 13,DEF,,,Drink
>> 20,GHI,1,2,Friends
>> 21,GHI,3,4,Friends
>> 22,GHI,5,6,Friends
>> 23,GHI,7,6,Friends
>> 24,GHI,6,4,Friends
>> 25,JKL,1,8,Order
>> 26,JKL,2,9,Order
>> 27,JKL,3,10,Order
>> 28,JKL,4,11,Order
>> 29,JKL,5,12,Order
>> 30,JKL,6,13,Order
>>
>> 5. Navigating to the 

Re: Solr 6 Distributed Join

2015-12-17 Thread Joel Bernstein
One thing to note about the hashJoin is that it requires the search results
from the hashed query to fit entirely in memory.

The innerJoin does not have this requirement as it performs a streaming
merge join.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein  wrote:

> Below is an example of nested joins where the innerJoin is done in
> parallel using the parallel function. The partitionKeys parameter needs to
> be added to the searches when the parallel function is used to partition
> the results across worker nodes.
>
> hashJoin(
> parallel(workerCollection,
> innerJoin(
> search(users, q="*:*",
> fl="userId, full_name, hometown", sort="userId asc", zkHost="zk2:2345",
> qt="/export" partitionKeys="userId"),
> search(reviews, q="*:*",
> fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
> qt="/export" partitionKeys="userId"),
> on="userId"
> ),
>  workers="20",
>  zkHost="zk1:2345",
>  sort="userId asc"
>  ),
>hashed=search(restaurants, q="city:nyc", fl="restaurantId, 
> restaurantName",
> sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
>on="restaurantId"
> )
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein 
> wrote:
>
>> The innerJoin joins two streams sorted by the same join keys (merge
>> join). If third stream has the same join keys you can nest innerJoins. But
>> all three tables need to be sorted by the same join keys to nest innerJoins
>> (merge joins).
>>
>> innerJoin(innerJoin(...),
>> search(...),
>> on...)
>>
>> If the third stream is joined on a different key you can nest inside a
>> hashJoin which doesn't require streams to be sorted on the join key. For
>> example:
>>
>> hashJoin(innerJoin(...),
>> hashed=search(...),
>> on..)
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed  wrote:
>>
>>> Hi again,
>>>
>>> I got the join to work. A team mate pointed out that one of the search
>>> functions in the innerJoin query was missing a field in the join - adding
>>> the e1 field to the fl parameter of the second search function gave the
>>> result I expected:
>>>
>>>
>>> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>>>
>>> , fl="id", q=text:John, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>>> fl="id,e1", q=text:Friends, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>>
>>> I am still interested in whether we can specify a join, using an
>>> arbitrary
>>> number of searches.
>>>
>>> Cheers
>>>
>>> Akiel
>>>
>>>
>>>
>>> From:   Akiel Ahmed/UK/IBM@IBMGB
>>> To: solr-user@lucene.apache.org
>>> Date:   16/12/2015 17:05
>>> Subject:Re: Solr 6 Distributed Join
>>>
>>>
>>>
>>> Hi Dennis,
>>>
>>> Thank you for your help. I used your explanation to construct an
>>> innerJoin
>>>
>>> query; I think I am getting further but didn't get the results I
>>> expected.
>>>
>>> The following describes what I did – is there any chance you can tell
>>> where I am going wrong:
>>>
>>> Solr 6 Developer Builds: #2738 and #2743
>>>
>>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema so
>>> it
>>>
>>> reads:
>>>
>>> 
>>> 
>>>   id
>>>   >> multiValued="false" docValues="true"/>
>>>   >> stored="true"
>>>
>>> required="false" multiValued="false" docValues="true"/>
>>>   >> required="false" multiValued="false" docValues="true"/>
>>>   >> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   >> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   >> required="false" multiValued="false"/>
>>>   
>>>   >> precisionStep="0" positionIncrementGap="0"/>
>>>   >> positionIncrementGap="100">
>>> 
>>>   
>>>   
>>>   >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>>>   >> words="lang/stopwords_en.txt"/>
>>> 
>>>   
>>> 
>>>
>>> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
>>> adding the following near the bottom of the file so it is the last
>>> request
>>>
>>> handler
>>>
>>>   
>>> 
>>> json
>>> false
>>> 
>>>   
>>>
>>> 3. Used solr -e cloud to setup a solr cloud instance, picking all the
>>> defaults except I chose basic_configs
>>>
>>> 4. After solr is running I ingested the following data via the Solr Web
>>> UI
>>>
>>> 

Re: API accessible without authentication even though Basic Auth Plugin is enabled

2015-12-17 Thread tine-2
Noble Paul നോബിള്‍  नोब्ळ् wrote
> It works as designed.
> 
> Protect the read path [...]

Works like described in 5.4.0, didn't work in 5.3.1, s.
https://issues.apache.org/jira/browse/SOLR-8408



--
View this message in context: 
http://lucene.472066.n3.nabble.com/API-accessible-without-authentication-even-though-Basic-Auth-Plugin-is-enabled-tp4244940p4246099.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow query response.

2015-12-17 Thread Jack Krupansky
A single query with tens of thousands of terms is very clearly a misuse of
Solr. If it happens to work at all, consider yourself lucky. Are you using
a standard Solr query parser or the terms query parser that lets you write
a raw list of terms to OR.

Are your nodes CPU-bound or I/O-bound during those 50-second intervals? My
bet is that your index does not fit fully in memory, causing lots of I/O to
repeatedly page in portions of the index and probably additional CPU usage
as well.

How many rows are you returning on each query? Are you using all these
terms just to filter a smaller query or to return a large bulk of documents?


-- Jack Krupansky

On Thu, Dec 17, 2015 at 7:01 AM, Modassar Ather 
wrote:

> Hi,
>
> I have a field f which is defined as follows.
>  omitNorms="true"/>
>
> Solr-5.2.1 is used. The index is spread across 12 shards (no replica) and
> the index size on each node is around 100 GB.
>
> When I search for 50 thousand values (ORed) in the field f it takes almost
> around 45 to 55 seconds.
> Per my understanding it is too slow. Kindly share your thoughts on this
> behavior and provide your suggestions.
>
> Thanks,
> Modassar
>


RE: Use multiple istance simultaneously

2015-12-17 Thread Gian Maria Ricci - aka Alkampfer
Hi,

I've a quick question on zookeeper, how can I run zookeeper as service in linux 
so it autostart if the instance is rebooted? The only information I've found in 
the internet is on this link 
http://positivealex.github.io/blog/posts/how-to-install-zookeeper-as-service-on-centos
 and it seems to be slightly old. 

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: outlook_288fbf38c031d...@outlook.com 
[mailto:outlook_288fbf38c031d...@outlook.com] On Behalf Of Gian Maria Ricci - 
aka Alkampfer
Sent: sabato 12 dicembre 2015 11:39
To: solr-user@lucene.apache.org
Subject: RE: Use multiple istance simultaneously

Thanks a lot for all the clarifications.

Actually resources are not a big problem, I think customer can afford 4 GB RAM 
Red Hat linux machines for Zookeeper. Solr Machines will have in production 64 
or 96 GB of ram, depending on the dimension of the index.

My primary concern is maintenance of the structure. With single independent 
machines, the situation is trivial, we can stop solr on one of the machine 
during the night, and issue a full backup of the indexes. With a full backup of 
the indexes, rebuilding a machine from scratch in case of disaster is simple, 
just spin off a new Virtual machine, restore the backup, restart solr and 
everything is ok.

If for any reason the SolrCloud cluster stops working, restoring everything is 
somewhat more complicated. Are there any best practice for SolrCloud to backup 
everything so we can restore the entire cluster if anything goes wrong?

Thanks a lot for the interesting discussion and for the really useful 
information you gave me.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: venerdì 11 dicembre 2015 17:11
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/11/2015 8:19 AM, Gian Maria Ricci - aka Alkampfer wrote:
> Thanks for all of your clarification. I know that solrcloud is a 
> really better configuration than any other, but actually it has a 
> complexity that is really higher. I just want to give you the pain 
> point I've noticed while I was gathering all the info I can got on SolrCloud.
> 
> 1) zookeeper documentation says that to have the best experience you 
> should have a dedicated filesystem for the persistence and it should 
> never swap to disk. I've not found any guidelines on how I should 
> dimension zookeeper machine, how much ram, disk? Can I install 
> zookeeper in the same machines where Solr resides ( I suspect no, 
> because Solr machine are under stress and if zookeeper start swapping is can 
> lead to problem)?

Standalone zookeeper doesn't require much in the way of resources.
Unless the SolrCloud installation is enormous, a machine with 1-2GB of RAM is 
probably plenty, if the only thing it is doing is zookeeper and it's not 
running Windows.  If the SolrCloud install has a lot of collections, shards, 
and/or servers, then you might need more, because the zookeeper database will 
be larger.

> 2) What about the update? If I need to update my solrcloud instance 
> and the new version requires a new version of zookeeper which is the 
> path to go? I need to first update zookeeper, or upgrading solr to existing 
> machine or?
> Maybe I did not search well but I did not find a comprehensive 
> guideline that told me how to upgrade my SolrCloud installation in various 
> situation.

If you're following recommendations and using standalone zookeeper, then 
upgrading it is entirely separate from upgrading Solr.  It's probably a good 
idea to upgrade your three (or more) zookeeper servers first.

Here's a FAQ entry from zookeeper about upgrades:

https://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6

> 3) Which are the best practices to run DIH in solrcloud? I think I can 
> round robin triggering DIH import on different server composing the 
> cloud infrastructure, or there is a better way to go? (I probably need 
> to trigger a DIH each 5/10 minutes but the number of new records is 
> really small)

When checking the status of an import, you must send the status request to the 
same machine where you sent the command to start the import.

If you're only ever going to run one DIH at a time, then I don't see any reason 
to involve multiple servers.  If you want to run more than one simultaneously, 
then you might want to run each one on a different machine.

> 4) Since I believe that it is not best practice to install zookeeper 
> on same SolrMachine (as separated process, not the built in 
> zookeeper), I need at least three more machine to maintain / monitor / 
> upgrade and I need also to monitor zookeeper, a new appliance that 
> need to be mastered by IT Infrastructure.

The only real reason to avoid zookeeper and Solr on the same machine is 
performance under high load, and mostly that comes down to I/O performance, so 
if you can put zookeeper on a separate set of disks, you're 

Re: Problem with Solr indexing "non-searchable" pdf files

2015-12-17 Thread Erick Erickson
Not sure how much help I can be, I have no clue what DSpace is
doing with Solr.

If you're willing to try to index straight to Solr, you can always use
SolrJ to parse the files, it's actually not very hard. Here's an example:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

some database stuff is mixed in there, but that can be removed.

Otherwise, perhaps the DSpace folks have more guidance on
what/how they expect to do with PDFs.

Best,
Erick

On Thu, Dec 17, 2015 at 6:54 AM, RICARDO EITO BRUN  wrote:
> Hi,
> I am using SOLR as part of the dspace 5.4 SW application.
> I have a problem when running the dspace indexing command
> (index-discovery). Most of the files are not being added to the index, and
> an exception is raised.
>
> It seems that Solr does not process the PDF files that are result of
> scanning without OCR (non-searchable PDF files).
>
> Is there any way to tell Solr that the document metadata should be
> processed even if the PDF file itself cannot be indexed?
>
> Any suggestion on how to make the pdf files "searchable" using some kind of
> batch process/tool?
>
> Thanks in advance,
> Ricardo
>
> --
> RICARDO EITO BRUN
> Universidad Carlos III de Madrid


Re: Strange debug output for a slow query

2015-12-17 Thread Shawn Heisey
On 12/16/2015 9:08 PM, Erick Erickson wrote:
> Hmmm, take a look at the individual queries on a shard, i.e. peek at
> the Solr logs and see if the fq clause comes through cleanly when you
> see =false. I suspect this is just a glitch in assembling the
> debug response. If it is, it probably deserves a JIRA. In fact it
> deserves a JIRA in either case I think.
> 
> I don't see anything obvious, but your statement "when the caches are
> cold" points to autowarming as your culprit. What to you have set up
> for autowarming in your caches? And do you have any newSearcher or
> firstSearcher events defined?

Both the main query and the shard queries in the log look fine -- only
one copy of the filters is present.  It does look like a problem with
the debug info gathering.

There are no firstSearcher or newSearcher events.  Some time ago, I did
have the same query defined for each (*:* with a sort), but those have
been commented.

These are my cache definitions:

  

  

  

On the dev server running a 5.3.2 snapshot, the last time a searcher was
opened on one of my large shards, filterCache took 4626ms to warm and
queryResultCache took 768ms.  Total warmup time for the searcher was 5394.

In production (4.9.1), the warmup times are worse.  One one of the
shards, total searcher warmup is 20943, filterCache is 6740, and
queryResultCache is 14202.  One difference in config -- autowarmCount on
queryResultCache is 8.

Because the autowarmCount values are low and still resulting in high
autowarm times, it's looking like a general performance issue.  I think
I may be having the problem I'm always telling other people about -- not
enough memory in the server for the OS disk cache.  There is about 150GB
of index data on each production server, 64GB of total RAM, and Solr has
an 8GB heap.  The index is always growing, so I think I may have hit one
of those thresholds where performance drops dramatically.

The production servers are maxed on memory and are each handling half
the large shards, so I think this simply means that I need more hardware
so that there is more total memory.  If I add another server to each
production copy, then each one will only need to handle a third of the
total index instead of half.  Each one will have about 100GB of index
data instead of 150GB.

I am also hoping that upgrading from 4.9.1 to the 5.3.2 snapshot will
increase performance.

Something I will try right now is bumping the heap to 9GB to see if
maybe there's heap starvation.  Based on the GC logs, I do not think
this is the problem.

Any other thoughts?

Thanks,
Shawn



Re: Strange debug output for a slow query

2015-12-17 Thread Erick Erickson
Yeah, if your warmup times are that long, then either you're
having lots of disk I/O contention or something. That said, you've
mentioned that after a while the queries are fine.

That indicates to me that you aren't autowarming _enough_ and
that your slow queries are not pre-loading parts of your index into
memory.

Of course you may be bottlenecking on memory or something else,
your idea of upping the memory would be the first thing I'd test. Although
you should be able to generate GC logs and if it is the case that you are
spending all your time GCing it should be clear from those logs.

And certainly you may simply have outgrown your hardware

Best,
Erick

On Thu, Dec 17, 2015 at 8:49 AM, Shawn Heisey  wrote:
> On 12/16/2015 9:08 PM, Erick Erickson wrote:
>> Hmmm, take a look at the individual queries on a shard, i.e. peek at
>> the Solr logs and see if the fq clause comes through cleanly when you
>> see =false. I suspect this is just a glitch in assembling the
>> debug response. If it is, it probably deserves a JIRA. In fact it
>> deserves a JIRA in either case I think.
>>
>> I don't see anything obvious, but your statement "when the caches are
>> cold" points to autowarming as your culprit. What to you have set up
>> for autowarming in your caches? And do you have any newSearcher or
>> firstSearcher events defined?
>
> Both the main query and the shard queries in the log look fine -- only
> one copy of the filters is present.  It does look like a problem with
> the debug info gathering.
>
> There are no firstSearcher or newSearcher events.  Some time ago, I did
> have the same query defined for each (*:* with a sort), but those have
> been commented.
>
> These are my cache definitions:
>
>class="solr.FastLRUCache"
> size="64"
> initialSize="64"
> autowarmCount="4"
> cleanupThread="true"
> showitems="true"
>   />
>
>class="solr.FastLRUCache"
> size="512"
> initialSize="512"
> autowarmCount="4"
> cleanupThread="true"
>   />
>
>class="solr.FastLRUCache"
> size="16384"
> initialSize="4096"
> cleanupThread="true"
>   />
>
> On the dev server running a 5.3.2 snapshot, the last time a searcher was
> opened on one of my large shards, filterCache took 4626ms to warm and
> queryResultCache took 768ms.  Total warmup time for the searcher was 5394.
>
> In production (4.9.1), the warmup times are worse.  One one of the
> shards, total searcher warmup is 20943, filterCache is 6740, and
> queryResultCache is 14202.  One difference in config -- autowarmCount on
> queryResultCache is 8.
>
> Because the autowarmCount values are low and still resulting in high
> autowarm times, it's looking like a general performance issue.  I think
> I may be having the problem I'm always telling other people about -- not
> enough memory in the server for the OS disk cache.  There is about 150GB
> of index data on each production server, 64GB of total RAM, and Solr has
> an 8GB heap.  The index is always growing, so I think I may have hit one
> of those thresholds where performance drops dramatically.
>
> The production servers are maxed on memory and are each handling half
> the large shards, so I think this simply means that I need more hardware
> so that there is more total memory.  If I add another server to each
> production copy, then each one will only need to handle a third of the
> total index instead of half.  Each one will have about 100GB of index
> data instead of 150GB.
>
> I am also hoping that upgrading from 4.9.1 to the 5.3.2 snapshot will
> increase performance.
>
> Something I will try right now is bumping the heap to 9GB to see if
> maybe there's heap starvation.  Based on the GC logs, I do not think
> this is the problem.
>
> Any other thoughts?
>
> Thanks,
> Shawn
>


Re: Solr Basic Configuration - Highlight - Begginer

2015-12-17 Thread Erick Erickson
I just tried it (admittedly using just a simple input obviously not a
PDF file) and
it works perfectly as I'd expect.

So a couple of things:
1> what happens if you highlight the content field? The text field
should be fine.
2> Did you completely blow away your index whenever you changed the
schema file? As in "rm -rf data" where the "data" directory is the
parent of "index"?
3> I'd consider backing off a bit and start with the standard
"techproducts" example and get highlighting to work _there_ first. My
guess is that there's something you're doing that I don't know to ask
about specifically with the PDF conversions.

er...@baffled.com

On Thu, Dec 17, 2015 at 3:00 AM, Evert R.  wrote:
> Hello Teague,
>
> Thanks for your reply and tip! I think Solr will give me a better result
> than just using Tika to read up my files and send to a Fulltext Index in my
> MySQL, which has the precise point of not highlighting the text snippets...
>
> So, I will keep on trying to fix Solr to my needs, and sure it works... I
> am missing something.
>
> Thanks again and I will keep on track.
>
> When I find the solution I will post all files and configs here for future
> references.
>
> Best regards,
>
> *Evert*
>
> 2015-12-17 6:11 GMT-02:00 Teague James :
>
>> Erik's comments not withstanding, there are some gaps in my understanding
>> of your precise situation. Here's a few things that weren't necessarily
>> obvious to me when I took my first try with Solr.
>>
>> Highlighting is the end result of a good hit. It is essentially formatting
>> applied to your hit. It is possible to get a hit without a highlight if
>> certain conditions exist.
>>
>> First, start by making sure you are indexing your target (a PDF file?)
>> correctly. Assuming you are indexing PDFs, are you extracting meta data
>> only or are you parsing the document with Tika? If you want hits on the
>> contents of your PDF, then you have to parse it at index time and store
>> that.That was why I suggested just running some queries through the
>> interface and the URL to see what Solr actually captured from your indexed
>> PDF before worrying about how it looks on the screen.
>>
>> Next, you should look carefully at the Analyzer's output. Notice the
>> abbreviations to the left of the columns? Hover over those to see what
>> filter factory it is. When words are split into multiple columns at one of
>> those points, it indicates that the filter factory broke apart the word
>> while analyzing it. Do a search for the filter filter factories that you
>> find and read up on them. In my case "1a" was being split into 4 by a word
>> delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
>> to fail in my case while still getting a hit. It also caused erroneous hits
>> elsewhere. Adding some switches to the schema is all it took to correct
>> that for me. However, every case is different based on your needs. That is
>> why it is important to go through the analyzer and see if Solr's indexing
>> and querying are doing what you expect.
>>
>> If that looks good and you've got solid hits all the way down, then it is
>> time to start looking at your highlighter implementation in the index and
>> query analyzers that you are using. My original issue of not being able to
>> highlight phrases with one set of tags necessitated me switching to the
>> fast vector highlighter - which had its own requirements for certain
>> parameters to be set. Here again - going to the Solr docs and reading up on
>> the various highlighters will be helpful in most cases.
>>
>> Solr has a very steep learning curve. I've been using it for several years
>> and I still consider myself a noob. It can be a deep dive, but don't be
>> discouraged. Keep at it. Cheers!
>>
>> -Teague
>>
>> On Wed, Dec 16, 2015 at 8:54 PM, Evert R.  wrote:
>>
>> > Hi Erick and Teague,
>> >
>> >
>> > I found that when using the field 'text' it shows the pdf file result
>> > id:pdf1 in this case, like:
>> >
>> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1=nietava
>> >
>> > but when highlight, using the text field...nothing comes up...
>> >
>> >
>> >
>> http://localhost:8983/solr/techproducts/select?q=text:nietava=id:pdf1=json=true=true=text=%3Cem%3E=%3C%2Fem%3E
>> >
>> > of even with the option
>> >
>> > f.text.hl.snippets=2 under the hl.fl field.
>> >
>> >
>> > I tried as well with the standard configuration, did it all over,
>> reindexed
>> > a couple times... and still did not work.
>> >
>> > Also,
>> >
>> > Using the Analysis, it brings below information:
>> >
>> > ST
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]0711
>> > SF
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]0711
>> > LCF
>> > textraw_bytesstartendpositionLengthtypeposition
>> > nietava[6e 69 65 74 61 76 61]0711
>> >
>> >
>> > Alphanumeric I think... so, it´s 'string', right? 

Re: Expected mime type application/octet-stream but got text/html

2015-12-17 Thread Erick Erickson
Andrej:

Indeed, it's a doc problem. A long time ago in a Solr far away, there
was a bunch of effort to use the "default" collection (collection1).
When that was changed, this documentation didn't get updated.

We'll update it in a few, thanks for reporting!

Erick

On Thu, Dec 17, 2015 at 1:39 AM, Andrej van der Zee
 wrote:
> It turns out that the documentation is not correct. If I specify the
> collection name after shards=, it does work as expected. So this works:
> curl "
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=1000=54.93.121.54:8986/solr/connects
> "
>
> This does not work:
> curl "
> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=1000=54.93.121.54:8986/solr
> "
>
> So I guess the documentation needs an update?
>
> Cheers,
> Andrej
>
>
> On Thu, Dec 17, 2015 at 10:36 AM, Markus Jelsma 
> wrote:
>
>> Hi - looks like Solr did not start up correctly, got some errors and kept
>> Jetty running. You should find information in that node's logs.
>> M.
>>
>>
>> -Original message-
>> > From:Andrej van der Zee 
>> > Sent: Thursday 17th December 2015 10:32
>> > To: solr-user@lucene.apache.org
>> > Subject: Expected mime type application/octet-stream but got text/html
>> >
>> > Hi,
>> >
>> > I am having troubles getting data from a particular shard, even though I
>> > follow the documentation:
>> >
>> > https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
>> >
>> > This is OK:
>> >
>> >  curl "
>> >
>> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true;
>> > {
>> >// returns correct result set
>> > }
>> >
>> > But this is NOT OK when I specify a particular shard:
>> >
>> > curl "
>> >
>> http://54.93.121.54:8986/solr/connects/select?q=*%3A*=json=true=54.93.121.54:8986/solr
>> > "
>> >
>> > {
>> >   "responseHeader":{
>> > "status":404,
>> > "QTime":5,
>> > "params":{
>> >   "q":"*:*",
>> >   "shards":"54.93.121.54:8986/solr",
>> >   "indent":"true",
>> >   "rows":"1000",
>> >   "wt":"json"}},
>> >   "error":{
>> > "msg":"Error from server at http://54.93.121.54:8986/solr: Expected
>> > mime type application/octet-stream but got text/html.
>> \n\n> > http-equiv=\"Content-Type\" content=\"text/html;
>> > charset=UTF-8\"/>\nError 404 Not
>> > Found\n\nHTTP ERROR 404\nProblem
>> accessing
>> > /solr/select. Reason:\nNot FoundPowered
>> by
>> > Jetty://\n\n\n\n",
>> > "code":404}}
>> >
>> > Any idea?
>> >
>> > Thanks,
>> > Andrej
>> >
>>


Re: SolrCloud 4.8.1 - commit wait

2015-12-17 Thread Erick Erickson
Glad to hear it's solved! The suggester stuff is
way cool, but can surprise you!

Erick

On Thu, Dec 17, 2015 at 2:54 AM, Vincenzo D'Amore  wrote:
> Great!!! Great Erick! It was a buildOnCommit.
>
> Many thanks for your help.
>
>
>
> On Wed, Dec 16, 2015 at 6:30 PM, Erick Erickson 
> wrote:
>
>> Quick scan, but probably this:
>>  INFO
>>  o.a.solr.spelling.suggest.Suggester - build()
>>
>> The suggester build process can easily take many minutes, there's some
>> explanation here:
>> https://lucidworks.com/blog/2015/03/04/solr-suggester/
>>
>> the short form is that depending on how it's defined, it may have to
>> read _all_ the
>> documents in your entire corpus to build the suggester structures. And
>> you apparently
>> have buildOnCommit set to true.
>>
>> Note particularly the caveats there about the Solr version required so that
>> buildOnStartup=false is honored.
>>
>> Best,
>> Erick
>>
>> On Wed, Dec 16, 2015 at 2:34 AM, Vincenzo D'Amore 
>> wrote:
>> > Hi,
>> >
>> > an update. Hope you can help me.
>> >
>> > I have stopped all the other working collections, in order to have a
>> clean
>> > log file.
>> >
>> > at 11:01:16 an hard commit has been issued
>> >
>> > 2015-12-16 11:01:49,839 [http-bio-8080-exec-824] INFO
>> >  org.apache.solr.update.UpdateHandler - start
>> >
>> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>> >
>> > at 11:11:31,344 the commit has been completed.
>> >
>> > The commit was ended logging this line, I suppose 615021 is the wait time
>> > (roughly 10 minutes) :
>> >
>> > 2015-12-16 11:11:31,343 [http-bio-8080-exec-991] INFO
>> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard2_replica3]
>> > webapp=/solr path=/update
>> >
>> params={waitSearcher=true=true=false=javabin=2}
>> > {commit=} 0 615021
>> >
>> > During this 10 minutes, the server logged "only" thes lines, looking at
>> > them I don't see anything of strange:
>> >
>> > 2015-12-16 11:01:50,705 [http-bio-8080-exec-824] INFO
>> >  o.a.solr.search.SolrIndexSearcher - Opening
>> > Searcher@6d5c31e2[catalogo_shard1_replica2]
>> > main
>> > 2015-12-16 11:01:50,724 [http-bio-8080-exec-824] INFO
>> >  org.apache.solr.update.UpdateHandler - end_commit_flush
>> > 2015-12-16 11:02:20,722 [searcherExecutor-108-thread-1] INFO
>> >  o.a.solr.spelling.suggest.Suggester - build()
>> > 2015-12-16 11:02:21,846 [http-bio-8080-exec-824] INFO
>> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard1_replica2]
>> > webapp=/solr path=/update
>> >
>> params={update.distrib=FROMLEADER=true=true=true=false=
>> >
>> http://192.168.101.118:8080/solr/catalogo_shard2_replica3/_end_point=true=javabin=2=false
>> }
>> > {commit=} 0 32007
>> > 2015-12-16 11:05:47,162 [http-bio-8080-exec-1037] INFO
>> >  org.apache.solr.update.UpdateHandler - start
>> >
>> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>> > 2015-12-16 11:05:47,970 [http-bio-8080-exec-1037] INFO
>> >  o.a.solr.search.SolrIndexSearcher - Opening
>> > Searcher@4ede7ac5[catalogo_shard2_replica3]
>> > main
>> > 2015-12-16 11:05:47,989 [http-bio-8080-exec-1037] INFO
>> >  org.apache.solr.update.UpdateHandler - end_commit_flush
>> > 2015-12-16 11:06:03,063 [commitScheduler-115-thread-1] INFO
>> >  org.apache.solr.update.UpdateHandler - start
>> >
>> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>> > 2015-12-16 11:06:03,896 [commitScheduler-115-thread-1] INFO
>> >  o.a.solr.search.SolrIndexSearcher - Opening
>> > Searcher@2bf4fd3a[catalogo_shard3_replica1]
>> > realtime
>> > 2015-12-16 11:06:03,913 [commitScheduler-115-thread-1] INFO
>> >  org.apache.solr.update.UpdateHandler - end_commit_flush
>> > 2015-12-16 11:06:19,435 [searcherExecutor-111-thread-1] INFO
>> >  o.a.solr.spelling.suggest.Suggester - build()
>> > 2015-12-16 11:06:20,589 [http-bio-8080-exec-1037] INFO
>> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard2_replica3]
>> > webapp=/solr path=/update
>> >
>> params={update.distrib=FROMLEADER=true=true=true=false=
>> >
>> http://192.168.101.118:8080/solr/catalogo_shard2_replica3/_end_point=true=javabin=2=false
>> }
>> > {commit=} 0 33427
>> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
>> >  org.apache.solr.update.UpdateHandler - start
>> >
>> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
>> >  org.apache.solr.update.UpdateHandler - No uncommitted changes. Skipping
>> > IW.commit.
>> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
>> >  o.a.solr.search.SolrIndexSearcher - Opening
>> > Searcher@75b2727f[catalogo_shard3_replica1]
>> > main
>> > 2015-12-16 11:08:07,084 [http-bio-8080-exec-1037] INFO
>> >  

Re: Expected mime type application/octet-stream but got text/html

2015-12-17 Thread Chris Hostetter
: 
: Indeed, it's a doc problem. A long time ago in a Solr far away, there
: was a bunch of effort to use the "default" collection (collection1).
: When that was changed, this documentation didn't get updated.
: 
: We'll update it in a few, thanks for reporting!

Fixed on erick's behalf because he had to run to a meeting...

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests

...i also wen't ahead to shift the examples to more emphasize using shard 
Ids since that's probably safer/cleaner for most people.



-Hoss
http://www.lucidworks.com/


Re: solr cloud invalid shard/collection configuration

2015-12-17 Thread Shawn Heisey
On 12/14/2015 10:47 PM, ig01 wrote:
> We installed solr with solr.cmd -e cloud utility that comes with the
> installation.
> The names of shards are odd because in this case after the installation
> We've migrated an old index from our other environment (wich is solr single
> node) and splitted it with Collection API splitt command.
> The splitting completed successfuly, documents were spreaded almost equally
> between two shards and I was able to retrieve our old documents. After that
> I deleted the old shard that was splitted (with Collection API delete
> command).
>
> Anyway this behavior is the same also for a regular solr cloud installation
> with solr.cmd -e cloud, without any index migration...

The solr.cmd script did not exist in Solr 4.4, which your initial
message on this thread indicated you were running.

If we ignore that and assume you're running a 5.x version, then we reach
another snag.  The "-e cloud" option is not for installation -- that's
for setting up a multi-node SolrCloud *example* with all nodes running
on the same host and one zookeeper node embedded in the first Solr
node.  A production installation of SolrCloud should have multiple
servers.  Each Solr server will most likely be running one SolrCloud
node, and three or five of your servers will be running a standalone
zookeeper, part of a zookeeper ensemble.

> We are indexing our documents by using the
> url="http://10.1.20.31/8983/solr/collection1/;.
> After the installation we indexed 4 documents and they all indexed on
> the same shard. 

This doesn't give any information about how you are indexing.  This is a
variable assignment, telling your program where to find Solr.  Your
indexing program could be using any of dozens of Solr libraries, or a
program that uses pure HTTP, constructing URLs internally.  It could
even be a shell script using curl.

Thanks,
Shawn



Re: Trying to index document in Solr with solr-spark library

2015-12-17 Thread Erick Erickson
Looks like your Spark job is not connecting to the same Zookeeper
as your Solr nodes.

Or, I suppose, the Solr nodes aren't started.

You might get more information on the Cloudera help boards

Best,
Erick

On Wed, Dec 16, 2015 at 11:58 PM, Guillermo Ortiz  wrote:
> I'm getting some errors when I try to use the solr-sparl library getting
> the error *KeeperErrorCode = NoNode for /live_nodes*.
>
> I download the library and compile with the branch_4.x since I'm using
> Cloudera 5.5.1 and Solr 4.10.3.
>
> I checked the logs of Solr and Zookeeper and I didn't find any error and
> navigate inside Zookeeper and the collection is created. These errors
> happen in the executors of Spark.
>
>
> 2015-12-16 16:31:43,923 [Executor task launch worker-1] INFO
> org.apache.zookeeper.ZooKeeper - Session: 0x1519126c7d55b23 closed
>
> 2015-12-16 16:31:43,924 [Executor task launch worker-1] ERROR org.apache.
> spark.executor.Executor - Exception in task 5.2 in stage 12.0 (TID 218)
> org.apache.solr.common.cloud.ZooKeeperException:
> at org.apache.solr
> .client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:252)
> at com.lucidworks.spark.SolrSupport.getSolrServer(SolrSupport.java:67)
> at com.lucidworks.spark.SolrSupport$4.call(SolrSupport.java:162)
> at com.lucidworks.spark.SolrSupport$4.call(SolrSupport.java:160)
> at org.apache.spark
> .api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
> at org.apache.spark
> .api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
> at org.apache.spark
> .rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
> at org.apache.spark
> .rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
> at org.apache.spark
> .SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at org.apache.spark
> .SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> *Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /live_nodes*
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> at org.apache.solr
> .common.cloud.SolrZkClient$7.execute(SolrZkClient.java:290)
> at org.apache.solr
> .common.cloud.SolrZkClient$7.execute(SolrZkClient.java:287)
> at org.apache.solr
> .common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74)
> at org.apache.solr
> .common.cloud.SolrZkClient.getChildren(SolrZkClient.java:287)
> at org.apache.solr
> .common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:334)
> at org.apache.solr
> .client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:243)


Re: A field _indexed_at_tdt added when I index documents.

2015-12-17 Thread Pushkar Raste
You must have this field in your schema with some default value assigned to
it (most probably default value is NOW). This field is usually used to
determine latest timestamp when this document was last indexed.

On 17 December 2015 at 04:51, Guillermo Ortiz  wrote:

> I'm indexing documents in solr with Spark and it's missing the a field
>  _indexed_at_tdt who is doesn't exist in my documents.
>
> I have added this field in my schema, why is this field being added? any
> solution?
>


propagate Query.rewrite call to super.rewrite after 5.4 upgrade

2015-12-17 Thread Markus Jelsma
Hi,

Apologies for the cross post. We have a class overridding 
SpanPositionRangeQuery. It is similar to a SpanFirst query but it is capable of 
adjusting the boost value with regard to distance. With the 5.4 upgrade the 
unit tests suddenly threw the following exception:

Query class org.GrSpanFirstQuery does not propagate Query.rewrite call to 
super.rewrite
at 
__randomizedtesting.SeedInfo.seed([CA3D7CF96D5E8E7:88BE883E6CA09E3F]:0)
at junit.framework.Assert.fail(Assert.java:57)
at junit.framework.Assert.assertTrue(Assert.java:22)
at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:73)
at 
org.apache.lucene.search.AssertingIndexSearcher.rewrite(AssertingIndexSearcher.java:83)
at 
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:886)
at 
org.apache.lucene.search.AssertingIndexSearcher.createNormalizedWeight(AssertingIndexSearcher.java:58)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:744)
at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:460)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:489)

I tracked it down to LUCENE-6590 - Explore different ways to apply boosts, but 
the solution did not really pop in my head right away. Implementing rewrite 
does not seem to change anything. Everything fails in the unit test at the 
point i want to retrieve docs and assert their positions in the result set: 
ScoreDoc[] docs = searcher.search(spanfirstquery, 10).scoreDocs;

I am probably missing something but any ideas to share?

Many thanks!
Markus


Re: Solr Basic Configuration - Highlight - Begginer

2015-12-17 Thread Evert R.
Hello Erick,

Sorry for my mistakes. Here is everything I got so far:

1. It bring the result perfectly but the hightlight (empty) field as below:
{

  "responseHeader":{
"status":0,
"QTime":15,
"params":{
  "q":"text:nietava",
  "debug":"query",
  "hl":"true",
  "hl.simple.post":"",
  "indent":"true",
  "fq":"id:pdf1",
  "hl.fl":"text",
  "wt":"json",
  "hl.simple.pre":""}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"pdf1",
"last_modified":"2011-07-28T20:39:26Z",
"title":["Microsoft Word - André Luiz - Sexo e Destino _Chico
e Waldo_.doc"],
"content_type":["application/pdf"],
"author":"Wander",
"author_s":"Wander",
"content":["André Luiz - Sexo e Destino _Chico e Waldo_.doc
***the whole content*** nietava"],

"_version_":1520765393269948416}]
  },
  *"highlighting":{
"pdf1":{***I THINK THE SNIPPETS OF TEXT SHOULD BE IN HERE, RIGHT?***}},*
  "debug":{
"rawquerystring":"text:nietava",
"querystring":"text:nietava",
"parsedquery":"text:nietava",
"parsedquery_toString":"text:nietava",
"QParser":"LuceneQParser",
"filter_queries":["id:pdf1"],

"parsed_filter_queries":["id:pdf1"]}}


2. Here is my settings:

In schema.xml:





  



  
  




  


In solrconfig.xml:

  explicit 10 false 

I have tried:

schema.xml:   

schema.xml:   

schema.xml:

















solrconfig.xml:

text
on
text
true
100



The debug is in the reply I have received.


I am still using the standard techproducts.


I hope this is complete enough.


Thanks again!



*Evert*

2015-12-17 2:01 GMT-02:00 Erick Erickson :

> bq: but when highlight, using the text field...nothing comes up...
>
>
> http://localhost:8983/solr/techproducts/select?q=text:nietava=id:pdf1=json=true=true=text=%3Cem%3E=%3C%2Fem%3E
>
> It's unclear what this means. No results showed up (i.e. numFound==0)
> or no highlighting showed up? Assuming that
> 1> the "text" field has stored=true and
> 2> you find documents when searching on the "text" field
> the above should show something in the highlights section.
>
> Please take the time to provide complete details. Guessing what you're
> doing is wasting time, mine and yours. Once more:
> 1> what is the schema definition for the "text" field. Include the
> fieldType definition
> 2> What is the result of adding =query to the field when you
> don't get highlights
>
> You might review: http://wiki.apache.org/solr/UsingMailingLists
> because it's becoming quite frustrating that you give us little bits
> of information that leave us guessing what you're _really_ doing.
> Highlighting is working for lots of people in lots of sites, it's not
> likely that this functionality is completely broken so the answer will
> be in the docs.
>
> Best,
> ERick
>
> On Wed, Dec 16, 2015 at 5:54 PM, Evert R.  wrote:
> > Hi Erick and Teague,
> >
> >
> > I found that when using the field 'text' it shows the pdf file result
> > id:pdf1 in this case, like:
> >
> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1=nietava
> >
> > but when highlight, using the text field...nothing comes up...
> >
> >
> http://localhost:8983/solr/techproducts/select?q=text:nietava=id:pdf1=json=true=true=text=%3Cem%3E=%3C%2Fem%3E
> >
> > of even with the option
> >
> > f.text.hl.snippets=2 under the hl.fl field.
> >
> >
> > I tried as well with the standard configuration, did it all over,
> reindexed
> > a couple times... and still did not work.
> >
> > Also,
> >
> > Using the Analysis, it brings below information:
> >
> > ST
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> > SF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> > LCF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> >
> >
> > Alphanumeric I think... so, it´s 'string', right? would that be a
> problem?
> > Should be some other indication?
> >
> >
> > Thanks again!
> >
> >
> > *Evert*
> >
> > 2015-12-16 21:09 GMT-02:00 Erick Erickson :
> >
> >> I think you're still missing the critical bit. Highlighting is
> >> completely separate from searching. In other words, you can search on
> >> one field and highlight another. What field is searched is governed by
> >> the "qf" parameter when using edismax and by the the "df" parameter
> >> configured in your request handler in solrconfig.xml. These defaults
> >> are overridden when you do a "fielded search" like
> >>
> >> q=content:nietava
> >>
> >> So this: q=content:nietava=true=content
> >> is searching the "content" field. The word you're looking for isn't in

Re: SolrCloud 4.8.1 - commit wait

2015-12-17 Thread Vincenzo D'Amore
Great!!! Great Erick! It was a buildOnCommit.

Many thanks for your help.



On Wed, Dec 16, 2015 at 6:30 PM, Erick Erickson 
wrote:

> Quick scan, but probably this:
>  INFO
>  o.a.solr.spelling.suggest.Suggester - build()
>
> The suggester build process can easily take many minutes, there's some
> explanation here:
> https://lucidworks.com/blog/2015/03/04/solr-suggester/
>
> the short form is that depending on how it's defined, it may have to
> read _all_ the
> documents in your entire corpus to build the suggester structures. And
> you apparently
> have buildOnCommit set to true.
>
> Note particularly the caveats there about the Solr version required so that
> buildOnStartup=false is honored.
>
> Best,
> Erick
>
> On Wed, Dec 16, 2015 at 2:34 AM, Vincenzo D'Amore 
> wrote:
> > Hi,
> >
> > an update. Hope you can help me.
> >
> > I have stopped all the other working collections, in order to have a
> clean
> > log file.
> >
> > at 11:01:16 an hard commit has been issued
> >
> > 2015-12-16 11:01:49,839 [http-bio-8080-exec-824] INFO
> >  org.apache.solr.update.UpdateHandler - start
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> >
> > at 11:11:31,344 the commit has been completed.
> >
> > The commit was ended logging this line, I suppose 615021 is the wait time
> > (roughly 10 minutes) :
> >
> > 2015-12-16 11:11:31,343 [http-bio-8080-exec-991] INFO
> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard2_replica3]
> > webapp=/solr path=/update
> >
> params={waitSearcher=true=true=false=javabin=2}
> > {commit=} 0 615021
> >
> > During this 10 minutes, the server logged "only" thes lines, looking at
> > them I don't see anything of strange:
> >
> > 2015-12-16 11:01:50,705 [http-bio-8080-exec-824] INFO
> >  o.a.solr.search.SolrIndexSearcher - Opening
> > Searcher@6d5c31e2[catalogo_shard1_replica2]
> > main
> > 2015-12-16 11:01:50,724 [http-bio-8080-exec-824] INFO
> >  org.apache.solr.update.UpdateHandler - end_commit_flush
> > 2015-12-16 11:02:20,722 [searcherExecutor-108-thread-1] INFO
> >  o.a.solr.spelling.suggest.Suggester - build()
> > 2015-12-16 11:02:21,846 [http-bio-8080-exec-824] INFO
> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard1_replica2]
> > webapp=/solr path=/update
> >
> params={update.distrib=FROMLEADER=true=true=true=false=
> >
> http://192.168.101.118:8080/solr/catalogo_shard2_replica3/_end_point=true=javabin=2=false
> }
> > {commit=} 0 32007
> > 2015-12-16 11:05:47,162 [http-bio-8080-exec-1037] INFO
> >  org.apache.solr.update.UpdateHandler - start
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > 2015-12-16 11:05:47,970 [http-bio-8080-exec-1037] INFO
> >  o.a.solr.search.SolrIndexSearcher - Opening
> > Searcher@4ede7ac5[catalogo_shard2_replica3]
> > main
> > 2015-12-16 11:05:47,989 [http-bio-8080-exec-1037] INFO
> >  org.apache.solr.update.UpdateHandler - end_commit_flush
> > 2015-12-16 11:06:03,063 [commitScheduler-115-thread-1] INFO
> >  org.apache.solr.update.UpdateHandler - start
> >
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > 2015-12-16 11:06:03,896 [commitScheduler-115-thread-1] INFO
> >  o.a.solr.search.SolrIndexSearcher - Opening
> > Searcher@2bf4fd3a[catalogo_shard3_replica1]
> > realtime
> > 2015-12-16 11:06:03,913 [commitScheduler-115-thread-1] INFO
> >  org.apache.solr.update.UpdateHandler - end_commit_flush
> > 2015-12-16 11:06:19,435 [searcherExecutor-111-thread-1] INFO
> >  o.a.solr.spelling.suggest.Suggester - build()
> > 2015-12-16 11:06:20,589 [http-bio-8080-exec-1037] INFO
> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard2_replica3]
> > webapp=/solr path=/update
> >
> params={update.distrib=FROMLEADER=true=true=true=false=
> >
> http://192.168.101.118:8080/solr/catalogo_shard2_replica3/_end_point=true=javabin=2=false
> }
> > {commit=} 0 33427
> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
> >  org.apache.solr.update.UpdateHandler - start
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
> >  org.apache.solr.update.UpdateHandler - No uncommitted changes. Skipping
> > IW.commit.
> > 2015-12-16 11:08:07,076 [http-bio-8080-exec-1037] INFO
> >  o.a.solr.search.SolrIndexSearcher - Opening
> > Searcher@75b2727f[catalogo_shard3_replica1]
> > main
> > 2015-12-16 11:08:07,084 [http-bio-8080-exec-1037] INFO
> >  org.apache.solr.update.UpdateHandler - end_commit_flush
> > 2015-12-16 11:08:39,040 [searcherExecutor-114-thread-1] INFO
> >  o.a.solr.spelling.suggest.Suggester - build()
> > 2015-12-16 11:08:40,286 [http-bio-8080-exec-1037] INFO
> >  o.a.s.u.processor.LogUpdateProcessor - [catalogo_shard3_replica1]
> > webapp=/solr 

Re: Issues when indexing PDF files

2015-12-17 Thread Charlie Hull

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:

Hi Alexandre,

Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?


Solr uses Tika to extract plain text from PDFs. If the PDFs have been 
created in a way that Tika cannot easily extract the text, there's 
nothing you can do in Solr that will help.


Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a 
character on a PDF page, but exactly how that character is generated 
(using a specific encoding, font, or even by drawing a picture) is 
outside your control. There are various businesses built on this premise 
- they charge for creating clean extracted text from PDFs - and even 
they have trouble with some PDFs.


HTH

Charlie



Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch 
wrote:


They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
wrote:


I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.

So does the issues lies with the encoding of the file? Are we able to

check

the encoding of the file?


Regards,
Edwin


On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
wrote:


Hi Erik,

I've shared the file on dropbox, which you can access via the link

here:



https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher 

wrote:



Edwin - Can you share one of those PDF files?

Also, drop the file into the Tika app and see what it sees directly -

get

the tika-app JAR and run that desktop application.

Could be an encoding issue?

 Erik

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <

edwinye...@gmail.com>

wrote:


Hi,

I'm using Solr 5.3.0

I'm indexing some PDF documents. However, for certain PDF files,

there

are

chinese text in the documents, but after indexing, what is indexed

in

the

content is either a series of "??" or an empty content.

I'm using the post.jar that comes together with Solr.

What could be the reason that causes this?

Regards,
Edwin














--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Solr Basic Configuration - Highlight - Begginer

2015-12-17 Thread Evert R.
Hello Teague,

Thanks for your reply and tip! I think Solr will give me a better result
than just using Tika to read up my files and send to a Fulltext Index in my
MySQL, which has the precise point of not highlighting the text snippets...

So, I will keep on trying to fix Solr to my needs, and sure it works... I
am missing something.

Thanks again and I will keep on track.

When I find the solution I will post all files and configs here for future
references.

Best regards,

*Evert*

2015-12-17 6:11 GMT-02:00 Teague James :

> Erik's comments not withstanding, there are some gaps in my understanding
> of your precise situation. Here's a few things that weren't necessarily
> obvious to me when I took my first try with Solr.
>
> Highlighting is the end result of a good hit. It is essentially formatting
> applied to your hit. It is possible to get a hit without a highlight if
> certain conditions exist.
>
> First, start by making sure you are indexing your target (a PDF file?)
> correctly. Assuming you are indexing PDFs, are you extracting meta data
> only or are you parsing the document with Tika? If you want hits on the
> contents of your PDF, then you have to parse it at index time and store
> that.That was why I suggested just running some queries through the
> interface and the URL to see what Solr actually captured from your indexed
> PDF before worrying about how it looks on the screen.
>
> Next, you should look carefully at the Analyzer's output. Notice the
> abbreviations to the left of the columns? Hover over those to see what
> filter factory it is. When words are split into multiple columns at one of
> those points, it indicates that the filter factory broke apart the word
> while analyzing it. Do a search for the filter filter factories that you
> find and read up on them. In my case "1a" was being split into 4 by a word
> delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
> to fail in my case while still getting a hit. It also caused erroneous hits
> elsewhere. Adding some switches to the schema is all it took to correct
> that for me. However, every case is different based on your needs. That is
> why it is important to go through the analyzer and see if Solr's indexing
> and querying are doing what you expect.
>
> If that looks good and you've got solid hits all the way down, then it is
> time to start looking at your highlighter implementation in the index and
> query analyzers that you are using. My original issue of not being able to
> highlight phrases with one set of tags necessitated me switching to the
> fast vector highlighter - which had its own requirements for certain
> parameters to be set. Here again - going to the Solr docs and reading up on
> the various highlighters will be helpful in most cases.
>
> Solr has a very steep learning curve. I've been using it for several years
> and I still consider myself a noob. It can be a deep dive, but don't be
> discouraged. Keep at it. Cheers!
>
> -Teague
>
> On Wed, Dec 16, 2015 at 8:54 PM, Evert R.  wrote:
>
> > Hi Erick and Teague,
> >
> >
> > I found that when using the field 'text' it shows the pdf file result
> > id:pdf1 in this case, like:
> >
> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1=nietava
> >
> > but when highlight, using the text field...nothing comes up...
> >
> >
> >
> http://localhost:8983/solr/techproducts/select?q=text:nietava=id:pdf1=json=true=true=text=%3Cem%3E=%3C%2Fem%3E
> >
> > ​of even with the option
> >
> > f.text.hl.snippets=2 under the hl.fl field.
> >
> >
> > I tried as well with the standard configuration, did it all over,
> reindexed
> > a couple times... and still did not work.
> >
> > Also,
> >
> > Using the Analysis, it brings below information:
> >
> > ST
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> > SF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> > LCF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]0711
> > ​
> >
> > Alphanumeric I think... so, it´s 'string', right? would that be a
> problem?
> > Should be some other indication?
> >
> >
> > Thanks again!
> >
> >
> > *Evert*
> >
> > 2015-12-16 21:09 GMT-02:00 Erick Erickson :
> >
> > > I think you're still missing the critical bit. Highlighting is
> > > completely separate from searching. In other words, you can search on
> > > one field and highlight another. What field is searched is governed by
> > > the "qf" parameter when using edismax and by the the "df" parameter
> > > configured in your request handler in solrconfig.xml. These defaults
> > > are overridden when you do a "fielded search" like
> > >
> > > q=content:nietava
> > >
> > > So this: q=content:nietava=true=content
> > > is searching the "content" field. The word you're looking for isn't in
> > > the content field so naturally no docs are 

Re: Security Problems

2015-12-17 Thread Jan Høydahl
Anyone cannot just go "INSERT foo INTO bar” on a random MySql server in the 
data room, so why should Solr be less secure once Auth is enabled?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. des. 2015 kl. 17.02 skrev Noble Paul :
> 
> I don't this behavior is intuitive. It is very easy to misunderstand
> 
> I would rather just add a flag to "authentication" plugin section
> which says "blockUnauthenticated" : true
> 
> which means all unauthenticated requests must be blocked.
> 
> 
> 
> 
> On Tue, Dec 15, 2015 at 7:09 PM, Jan Høydahl  wrote:
>> Yes, that’s why I believe it should be:
>> 1) if only authentication is enabled, all users must authenticate and all 
>> authenticated users can do anything.
>> 2) if authz is enabled, then all users must still authenticate, and can by 
>> default do nothing at all, unless assigned proper roles
>> 3) if a user is assigned the default “read” rule, and a collection adds a 
>> custom “/myselect” handler, that one is unavailable until the user gets it 
>> assigned
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 14. des. 2015 kl. 14.15 skrev Noble Paul :
>>> 
>>> ". If all paths were closed by default, forgetting to configure a path
>>> would not result in a security breach like today."
>>> 
>>> But it will still mean that unauthorized users are able to access,
>>> like guest being able to post to "/update". Just authenticating is not
>>> enough without proper authorization
>>> 
>>> On Mon, Dec 14, 2015 at 3:59 PM, Jan Høydahl  wrote:
> 1) "read" should cover all the paths
 
 This is very fragile. If all paths were closed by default, forgetting to 
 configure a path would not result in a security breach like today.
 
 /Jan
>>> 
>>> 
>>> 
>>> --
>>> -
>>> Noble Paul
>> 
> 
> 
> 
> -- 
> -
> Noble Paul



Slow query response.

2015-12-17 Thread Modassar Ather
Hi,

I have a field f which is defined as follows.


Solr-5.2.1 is used. The index is spread across 12 shards (no replica) and
the index size on each node is around 100 GB.

When I search for 50 thousand values (ORed) in the field f it takes almost
around 45 to 55 seconds.
Per my understanding it is too slow. Kindly share your thoughts on this
behavior and provide your suggestions.

Thanks,
Modassar