Re: TIKA OCR not working

2015-04-24 Thread trung.ht
HI everyone,

Does anyone have the answer for this problem :)?


I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> but it looks like it does not work. Does anyone know that TIKA OCR works
> automatically with Solr or I have to change some settings?
>
>>
Trung.


> It's not clear if OCR would happen automatically in Solr Cell, or if
>> changes to Solr would be needed.
>>
>> For Tika OCR info, see:
>>
>> https://issues.apache.org/jira/browse/TIKA-93
>> https://wiki.apache.org/tika/TikaOCR
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
>> it
>> > in use yet.
>> >
>> > Regards,
>> > Alex
>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" 
>> wrote:
>> >
>> > > Hi Trung,
>> > >
>> > > I didn't know about OCR capabilities of tika.
>> > > Someone who is familiar with sold-cell can inform us whether this
>> > > functionality is added to solr or not.
>> > >
>> > > Ahmet
>> > >
>> > >
>> > >
>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht 
>> wrote:
>> > > Hi Ahmet,
>> > >
>> > > I used a png file, not a pdf file. From the document, I understand
>> that
>> > > solr will post the file to tika, and since tika 1.7, OCR is included.
>> Is
>> > > there something I misunderstood.
>> > >
>> > > Trung.
>> > >
>> > >
>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> > > >
>> > > wrote:
>> > >
>> > > > Hi Trung,
>> > > >
>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>> based
>> > > > pdfs.
>> > > >
>> > > > Ahmet
>> > > >
>> > > >
>> > > >
>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht 
>> > wrote:
>> > > >
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I want to use solr to index some scanned document, after settings
>> solr
>> > > > document with a two field "content" and "filename", I tried to
>> upload
>> > the
>> > > > attached file, but it seems that the content of the file is only
>> "\n \n
>> > > > \n".
>> > > > But if I used the tesseract from command line I got the result
>> > correctly.
>> > > >
>> > > > The log when solr receive my request:
>> > > > ---
>> > > > INFO  - 2015-04-23 03:49:25.941;
>> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
>> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
>> > > =flat&
>> > > > resource.name=phplNiPrs&literal.id
>> > > >
>> > >
>> >
>> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > > >
>> > > > 
>> > > >
>> > > > The document when I check on solr admin page:
>> > > > -
>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> "createddate":
>> > > > "2015-04-22T15:00:00Z", "filename":
>> > "trunght\\test\\tesseract_3.png",
>> > > > "autocomplete_text": [ "trunght\\test\\tesseract_3.png" ],
>> > > "content": "
>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > \n
>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > ",
>> > > > "_version_": 1499213034586898400 }
>> > > >
>> > > > ---
>> > > >
>> > > > Since I am a solr newbie I do not know where to look, can anyone
>> give
>> > me
>> > > > an advice for where to look for error or settings to make it work.
>> > > > Thanks in advanced.
>> > > >
>> > > > Trung.
>> > > >
>> > >
>> >
>>
>
>


Using SolrJ to access schema.xml

2015-04-24 Thread Steven White
Hi Everyone,

Per this link
https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-ListFieldTypes
Solr supports REST Schema API to modify to the schema.  I looked at
http://lucene.apache.org/solr/4_2_1/solr-solrj/index.html?overview-summary.html
in hope SolrJ has Java API to allow schema modification, but I couldn't
find any.  Is this the case or did I not look hard enough?

My need is to manage Solr's schema.xml file using a remote API.  The REST
Schema API gets me there but but I have to write code to work with the
response / request XML which I much rather avoid if it is already out there.

Thanks

Steve


Re: payload similarity

2015-04-24 Thread Erick Erickson
I put up a complete example not too long ago that may help, see:

http://lucidworks.com/blog/end-to-end-payload-example-in-solr/

Best,
Erick

On Fri, Apr 24, 2015 at 6:33 AM, Dmitry Kan  wrote:
> Ahmet, exactly. As I have just illustrated with code, simultaneously with
> your reply. Thanks!
>
> On Fri, Apr 24, 2015 at 4:30 PM, Ahmet Arslan 
> wrote:
>
>> Hi Dmitry,
>>
>> I think, it is activated by PayloadTermQuery.
>>
>> Ahmet
>>
>>
>>
>> On Friday, April 24, 2015 2:51 PM, Dmitry Kan 
>> wrote:
>> Hi,
>>
>>
>> Using the approach here
>> http://lucidworks.com/blog/getting-started-with-payloads/ I have
>> implemented my own PayloadSimilarity class. When debugging the code I have
>> noticed, that the scorePayload method is never called. What could be wrong?
>>
>>
>> [code]
>>
>> class PayloadSimilarity extends DefaultSimilarity {
>> @Override
>> public float scorePayload(int doc, int start, int end, BytesRef
>> payload) {
>> float payloadValue = PayloadHelper.decodeFloat(payload.bytes);
>> System.out.println("payloadValue = " + payloadValue);
>> return payloadValue;
>> }
>> }
>>
>> [/code]
>>
>>
>> Here is how the similarity is injected during indexing:
>>
>> [code]
>>
>> PayloadEncoder encoder = new FloatEncoder();
>> IndexWriterConfig indexWriterConfig = new
>> IndexWriterConfig(Version.LUCENE_4_10_4, new
>> PayloadAnalyzer(encoder));
>> payloadSimilarity = new PayloadSimilarity();
>> indexWriterConfig.setSimilarity(payloadSimilarity);
>> IndexWriter writer = new IndexWriter(dir, indexWriterConfig);
>>
>> [/code]
>>
>>
>> and during searching:
>>
>> [code]
>>
>> IndexReader indexReader = DirectoryReader.open(dir);
>> IndexSearcher searcher = new IndexSearcher(indexReader);
>> searcher.setSimilarity(payloadSimilarity);
>>
>> TermQuery termQuery = new TermQuery(new Term("body", "dogs"));
>> termQuery.setBoost(1.1f);
>> TopDocs topDocs = searcher.search(termQuery, 10);
>> printResults(searcher, termQuery, topDocs);
>>
>>
>> [/code]
>>
>> --
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>
>
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info


Re: require diversity in results?

2015-04-24 Thread Erick Erickson
Often, for small numbers of distinct types people use grouping and
have the app layer mingle them or whatever is pleasing. I think this
is different than post-processing you mention. Grouping (aka "field
collapsing") can be expensive if there are a large number of groups
but for small numbers it's not bad.

The general problem is rather "interesting" in that just relaxing the
score doesn't really handle massively different kinds of documents,
think of "songs", "albums", and "reviews in rolling stone". You'd
probably see a bazillion song entries before the first review
entry

Best,
Erick

On Fri, Apr 24, 2015 at 2:36 AM, Paul Libbrecht  wrote:
> Hello list,
>
> I'm wondering if there could extra parameters or query operators that
> where I could impose that sorting by relevance should be relaxed so that
> there's a minimum diversity in some fields in the first page of results.
>
> For example, I'd like the search results to contain at least three
> possible type of resources in the first page, fetching things from below
> if needed.
> I know that could be done as a search result post-processor but I think
> that this is generally a bad idea for performance.
>
> Any other idea?
>
> thanks
>
> Paul
>


Re: Odp.: solr issue with pdf forms

2015-04-24 Thread Erick Erickson
Steve:

Right, it's not exactly obvious. Bring up the admin UI, something like
http://localhost:8983/solr. From there you have to select a core in
the 'core selector' drop-down on the left side. If you're using
SolrCloud, this will have a rather strange name, but it should be easy
to identify what collection it belongs to.

At that point you'll see a bunch of new options, among them "schema
browser". From there, select your field from the drop-down that will
appear, then a button should pop up "load term info".

NOTE: you can get the same information from the TermsComponent, see:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
This is a little more flexible because you can, among other things,
specify the place to start. In your case you might specify
terms.prefix=mein which will show you the terms that are actually
being _searched_ as opposed to being stored. This latter is what you
see in the browser when you search for docs and is sometimes
misleading as you're (probably) seeing.

Best,
Erick

On Fri, Apr 24, 2015 at 1:58 AM,   wrote:
> Hey Erick,
>
> thanks a lot for your answer. I went to the admin schema browser, but what 
> should I see there? Sorry I'm not firm with the admin schema browser. :-(
>
> Best
> Steve
>
>
> -Ursprüngliche Nachricht-
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Donnerstag, 23. April 2015 18:00
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was sent to 
> Solr, _not_ the actual tokens in the index. What do you see when you go to 
> the admin schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you see in 
> the browser when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see your 
> analysis chain, i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed just 
> fine, but I've certainly been wrong before, more times than I want to 
> remember.
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,   wrote:
>> Hey Erick,
>>
>> thanks for your answer. They are not indexed correctly. Also throught the 
>> solr admin interface I see these typical questionmarks within a rhombus 
>> where a blank space should be.
>> I now figured out the following (not sure if it is relevant at all):
>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>> indexed correctly, no issues
>> - PDF documents (with editable form fields) created with "Adobe
>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>
>> Best
>> Steve
>>
>> -Ursprüngliche Nachricht-
>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> Gesendet: Mittwoch, 22. April 2015 17:11
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Are they not _indexed_ correctly or not being displayed correctly?
>> Take a look at admin UI>>schema browser>> your field and press the "load 
>> terms" button. That'll show you what is _in_ the index as opposed to what 
>> the raw data looked like.
>>
>> When you return the field in a Solr search, you get a verbatim, un-analyzed 
>> copy of your original input. My guess is that your browser isn't using the 
>> compatible character encoding for display.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 22, 2015 at 7:08 AM,   wrote:
>>> Thanks for your answer. Maybe my English is not good enough, what are you 
>>> trying to say? Sorry I didn't get the point.
>>> :-(
>>>
>>>
>>> -Ursprüngliche Nachricht-
>>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>>> Betreff: Odp.: solr issue with pdf forms
>>>
>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>
>>> @LAFK_PL
>>>   Oryginalna wiadomość
>>> Od: steve.sch...@t-systems.com
>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>> Do: solr-user@lucene.apache.org
>>> Odpowiedz: solr-user@lucene.apache.org
>>> Temat: solr issue with pdf forms
>>>
>>> Hi guys,
>>>
>>> hopefully you can help me with my issue. We are using a solr setup and have 
>>> the following issue:
>>> - usual pdf files are indexed just fine
>>> - pdf files with writable form-fields look like this:
>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>>> v ollständig sind
>>>
>>> Somehow the blank space character is not indexed correctly.
>>>
>>> Is this a know issue? Does anybody have an idea?
>>>
>>> Thanks a lot
>>> Best
>>> Steve


Re: AW: o.a.s.c.SolrException: missing content stream

2015-04-24 Thread Chris Hostetter

: Another question I have though (which fits the subject even better):
: In the log I see many
: org.apache.solr.common.SolrException: missing content stream
...
: What are possible reasons herfore?

The possible and likeley reasons are that you sent an "update" request w/o 
any ContentStream (ie: no data to update).



-Hoss
http://www.lucidworks.com/


Re: and stopword in user query is being change to q.op=AND

2015-04-24 Thread Chris Hostetter

: I was under understanding that stopwords are filtered even before being
: parsed by search handler, i do have the filter in collection schema to
: filter stopwords and the analysis shows that this stopword is filtered

Generally speaking, your understanding of the order of operations for 
query parsing (regardless of hte parser) and analysis (regardless of the 
fields/analyzers/filters/etc...) is backwards.


the query parser gets, as it's input, the query string (as a *single* 
string) and the request params.  it inspects/parses the string according 
to it's rules & options & syntax and based on what it finds in that string 
(and in other request params) it passes some/all of that string to the 
analyzer for one or more fields, and uses the results of those analyzers 
as the terms for building up a query structure.

ask yourself: if the raw user query input was first passed to an analyzer 
(for stop word filtering as you suggest) before the being passed to the 
query parser -- how would solr know what analyzer to use?  in many parsers 
(like lucene and edismax) the fields to use can be specified *inside* the 
query string itself

likewise: how would you ensure that syntactically significant string 
sequences (like "(" and ":" and "AND" etc..) that an analyzer might 
normally strip out based on the tokenizer/tokenfilters would be preserved 
so that the query parser could have them and use them to drive hte 
resulting query structure?



-Hoss
http://www.lucidworks.com/


Re: and stopword in user query is being change to q.op=AND

2015-04-24 Thread Shawn Heisey
On 4/24/2015 10:55 AM, Rajesh Hazari wrote:
> I was under understanding that stopwords are filtered even before
> being parsed by search handler, i do have the filter in collection
> schema to filter stopwords and the analysis shows that this stopword
> is filtered
>
> Analysis response :  attached is the solr analysis json response.

There is a combination of things happening here.

The "lowercaseOperators" parameter for edismax defaults to true ...
which means that the "and" in your query is being interpreted as "AND"
-- a boolean operator -- it's not making it to analysis.

Because the query now has boolean operators, you are running into this -
a bug that we have had in our tracker for a VERY long time:

https://issues.apache.org/jira/browse/SOLR-2649

Side note:  I personally feel that lowercaseOperators should default to
false, but I haven't made any effort to get it changed.

Thanks,
Shawn



Re: and stopword in user query is being change to q.op=AND

2015-04-24 Thread Rajesh Hazari
I was under understanding that stopwords are filtered even before being
parsed by search handler, i do have the filter in collection schema to
filter stopwords and the analysis shows that this stopword is filtered

Analysis response :  attached is the solr analysis json response.

[image: Inline image 1]
Schema definition :

  <
filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt" />  



*shouldn't the final filter query terms be sent to search handler?*

*Thanks,*
*Rajesh**.*

On Thu, Apr 23, 2015 at 2:56 PM, Chris Hostetter 
wrote:

>
> : And stopword  in user query is being changed to q.op=AND, i am going to
> : look more into this
>
> This is an explicitly documented feature of the edismax parser...
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
> * treats "and" and "or" as "AND" and "OR" in Lucene syntax mode.
>
> ...
>
> The lowercaseOperators Parameter
>
> A Boolean parameter indicating if lowercase "and" and "or" should be
> treated the same as operators "AND" and "OR".
>
>
>
>
> : i thought of sharing this in solr community just in-case if someone have
> : came across this issue.
> : OR
> : I will also be validating my config and schema if i am doing something
> : wrong.
> :
> : solr : 4.9
> : query parser: edismax
> :
> : when i search for "*q=derek and romace*" final parsed query is
> : *"(+(+DisjunctionMaxQuery((textSpell:derek))
> : +DisjunctionMaxQuery((textSpell:romance/no_coord" *
> : *
> :   "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]*
> :
> : when i search for "*q=derek romace*" final parsed query is
> *"parsedquery":
> : "(+(DisjunctionMaxQuery((textSpell:derek))
> : DisjunctionMaxQuery((textSpell:romance/no_coord",*
> : *response": {*
> : *"numFound": 1405,*
> : *"start": 0,*
> : *"maxScore": 0.2780709,*
> : *"docs": [.*
> :
> : textSpell field definition :
> :
> :  : omitNorms="true" multiValued="true" />
> :
> :  : positionIncrementGap="100">
> :   
> : 
> :  : words="stopwords.txt" />
> : 
> :  : protected="protwords.txt"/>
> :   
> :   
> : 
> :  : words="stopwords.txt" />
> :  : ignoreCase="true" expand="false"  />
> : 
> :  : protected="protwords.txt"/>
> :   
> : 
> :
> : Let me know if anyone of you guys need more info.
> :
> : *Thanks,*
> : *Rajesh**.*
> :
>
> -Hoss
> http://www.lucidworks.com/
>


analysis.json
Description: application/json


Re: Checking of Solr Memory and Disk usage

2015-04-24 Thread Zheng Lin Edwin Yeo
Meaning this was working fine until Solr 5.0.0? I'm quite new to Solr and I
only started to use it when Solr 5.0.0 was released.

Regards,
Edwin

On 24 April 2015 at 18:20, Tom Evans  wrote:

> On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo
>  wrote:
> > Hi,
> >
> > So has anyone knows what is the issue with the "Heap Memory Usage"
> reading
> > showing the value -1. Should I open an issue in Jira?
>
> I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers
> the core statistics have values for heap memory, on the solr 5.0.0
> ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK
> on both versions.
>
> I don't see this issue in the fixed bugs in 5.1.0, but I only looked
> at the headlines of the tickets..
>
> http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes
>
> Cheers
>
> Tom
>


RE: Remote connection to Solr

2015-04-24 Thread Garth Grimm
Shawn's explanation fits better with why Websphere and Jetty might behave 
differently.  But something else that might be happening could be if the DHCP 
negotiation causes the IP address to change from one network to another and 
back.

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Friday, April 24, 2015 9:23 AM
To: solr-user@lucene.apache.org
Subject: Re: Remote connection to Solr

Hi Shawn,

The firewall was the first thing I looked into and after fiddling with it, I 
still see the issue.  But if that was the issue, why WebSphere doesn't run into 
it but Jetty is?  However, your point about domain / non domain and private / 
public network maybe provide me with some new area to look into.

Thanks

Steve

On Fri, Apr 24, 2015 at 10:11 AM, Shawn Heisey  wrote:

> On 4/24/2015 8:03 AM, Steven White wrote:
> > This maybe a Jetty question but let me start here first.
> >
> > I have Solr running on my laptop and from my desktop I have no issue 
> > accessing it.  However, if I take my laptop home and connect it to 
> > my
> home
> > network, the next day when I connect the laptop to my office 
> > network, I
> no
> > longer can access Solr from my desktop.  A restart of Solr will not 
> > do,
> the
> > only fix is to restart my Windows 8.1 OS (that's what's on my laptop).
> >
> > I have not been able to figure out why this is happening and I'm
> suspecting
> > it has to do something with Jetty because I have Solr 3.6 running on 
> > my laptop in a WebSphere profile and it does not run into this issue.
> >
> > Any ideas what could be causing this?  Is this question for the 
> > Jetty mailing list?
>
> I'm guessing the Windows firewall is the problem here.  I'm betting 
> your computer is detecting your home network and the office network as 
> two different types (one as domain, the other as private, possibly), 
> and that the Windows firewall only allows connections to Jetty when 
> you are on one of those types of networks.  The websphere install may 
> have add explicit firewall exceptions for all network types when it was 
> installed.
>
> Fiddling with the firewall exceptions is probably the way to fix this.
>
> Thanks,
> Shawn
>
>


Re: Remote connection to Solr

2015-04-24 Thread Steven White
Hi Shawn,

The firewall was the first thing I looked into and after fiddling with it,
I still see the issue.  But if that was the issue, why WebSphere doesn't
run into it but Jetty is?  However, your point about domain / non domain
and private / public network maybe provide me with some new area to look
into.

Thanks

Steve

On Fri, Apr 24, 2015 at 10:11 AM, Shawn Heisey  wrote:

> On 4/24/2015 8:03 AM, Steven White wrote:
> > This maybe a Jetty question but let me start here first.
> >
> > I have Solr running on my laptop and from my desktop I have no issue
> > accessing it.  However, if I take my laptop home and connect it to my
> home
> > network, the next day when I connect the laptop to my office network, I
> no
> > longer can access Solr from my desktop.  A restart of Solr will not do,
> the
> > only fix is to restart my Windows 8.1 OS (that's what's on my laptop).
> >
> > I have not been able to figure out why this is happening and I'm
> suspecting
> > it has to do something with Jetty because I have Solr 3.6 running on my
> > laptop in a WebSphere profile and it does not run into this issue.
> >
> > Any ideas what could be causing this?  Is this question for the Jetty
> > mailing list?
>
> I'm guessing the Windows firewall is the problem here.  I'm betting your
> computer is detecting your home network and the office network as two
> different types (one as domain, the other as private, possibly), and
> that the Windows firewall only allows connections to Jetty when you are
> on one of those types of networks.  The websphere install may have add
> explicit firewall exceptions for all network types when it was installed.
>
> Fiddling with the firewall exceptions is probably the way to fix this.
>
> Thanks,
> Shawn
>
>


Re: ArrayIndexOutOfBoundsException in RecordingJSONParser.java

2015-04-24 Thread Scott Dawson
Ticket opened: https://issues.apache.org/jira/i#browse/SOLR-7462

Thanks,
Scott

On Fri, Apr 24, 2015 at 9:38 AM, Shawn Heisey  wrote:

> On 4/24/2015 7:16 AM, Scott Dawson wrote:
> > Should I create a JIRA ticket? (Am I allowed to?)  I can provide more
> info
> > about my particular usage including a stacktrace if that's helpful. I'm
> > using the new custom JSON indexing, which, by the way, is an excellent
> > feature and will be of great benefit to my project. Thanks for that.
>
> Ouch.  Thanks for finding the bug!
>
> Anyone can create an account on the Apache Jira and then create issues.
>  Please do!  The issue for this bug would go in the SOLR project.
>
> https://issues.apache.org/jira/browse/SOLR
>
> Thanks,
> Shawn
>
>


Re: Remote connection to Solr

2015-04-24 Thread Shawn Heisey
On 4/24/2015 8:03 AM, Steven White wrote:
> This maybe a Jetty question but let me start here first.
> 
> I have Solr running on my laptop and from my desktop I have no issue
> accessing it.  However, if I take my laptop home and connect it to my home
> network, the next day when I connect the laptop to my office network, I no
> longer can access Solr from my desktop.  A restart of Solr will not do, the
> only fix is to restart my Windows 8.1 OS (that's what's on my laptop).
> 
> I have not been able to figure out why this is happening and I'm suspecting
> it has to do something with Jetty because I have Solr 3.6 running on my
> laptop in a WebSphere profile and it does not run into this issue.
> 
> Any ideas what could be causing this?  Is this question for the Jetty
> mailing list?

I'm guessing the Windows firewall is the problem here.  I'm betting your
computer is detecting your home network and the office network as two
different types (one as domain, the other as private, possibly), and
that the Windows firewall only allows connections to Jetty when you are
on one of those types of networks.  The websphere install may have add
explicit firewall exceptions for all network types when it was installed.

Fiddling with the firewall exceptions is probably the way to fix this.

Thanks,
Shawn



Remote connection to Solr

2015-04-24 Thread Steven White
Hi Everyone,

This maybe a Jetty question but let me start here first.

I have Solr running on my laptop and from my desktop I have no issue
accessing it.  However, if I take my laptop home and connect it to my home
network, the next day when I connect the laptop to my office network, I no
longer can access Solr from my desktop.  A restart of Solr will not do, the
only fix is to restart my Windows 8.1 OS (that's what's on my laptop).

I have not been able to figure out why this is happening and I'm suspecting
it has to do something with Jetty because I have Solr 3.6 running on my
laptop in a WebSphere profile and it does not run into this issue.

Any ideas what could be causing this?  Is this question for the Jetty
mailing list?

Thanks

Steve


Re: ArrayIndexOutOfBoundsException in RecordingJSONParser.java

2015-04-24 Thread Shawn Heisey
On 4/24/2015 7:16 AM, Scott Dawson wrote:
> Should I create a JIRA ticket? (Am I allowed to?)  I can provide more info
> about my particular usage including a stacktrace if that's helpful. I'm
> using the new custom JSON indexing, which, by the way, is an excellent
> feature and will be of great benefit to my project. Thanks for that.

Ouch.  Thanks for finding the bug!

Anyone can create an account on the Apache Jira and then create issues.
 Please do!  The issue for this bug would go in the SOLR project.

https://issues.apache.org/jira/browse/SOLR

Thanks,
Shawn



Re: payload similarity

2015-04-24 Thread Dmitry Kan
Ahmet, exactly. As I have just illustrated with code, simultaneously with
your reply. Thanks!

On Fri, Apr 24, 2015 at 4:30 PM, Ahmet Arslan 
wrote:

> Hi Dmitry,
>
> I think, it is activated by PayloadTermQuery.
>
> Ahmet
>
>
>
> On Friday, April 24, 2015 2:51 PM, Dmitry Kan 
> wrote:
> Hi,
>
>
> Using the approach here
> http://lucidworks.com/blog/getting-started-with-payloads/ I have
> implemented my own PayloadSimilarity class. When debugging the code I have
> noticed, that the scorePayload method is never called. What could be wrong?
>
>
> [code]
>
> class PayloadSimilarity extends DefaultSimilarity {
> @Override
> public float scorePayload(int doc, int start, int end, BytesRef
> payload) {
> float payloadValue = PayloadHelper.decodeFloat(payload.bytes);
> System.out.println("payloadValue = " + payloadValue);
> return payloadValue;
> }
> }
>
> [/code]
>
>
> Here is how the similarity is injected during indexing:
>
> [code]
>
> PayloadEncoder encoder = new FloatEncoder();
> IndexWriterConfig indexWriterConfig = new
> IndexWriterConfig(Version.LUCENE_4_10_4, new
> PayloadAnalyzer(encoder));
> payloadSimilarity = new PayloadSimilarity();
> indexWriterConfig.setSimilarity(payloadSimilarity);
> IndexWriter writer = new IndexWriter(dir, indexWriterConfig);
>
> [/code]
>
>
> and during searching:
>
> [code]
>
> IndexReader indexReader = DirectoryReader.open(dir);
> IndexSearcher searcher = new IndexSearcher(indexReader);
> searcher.setSimilarity(payloadSimilarity);
>
> TermQuery termQuery = new TermQuery(new Term("body", "dogs"));
> termQuery.setBoost(1.1f);
> TopDocs topDocs = searcher.search(termQuery, 10);
> printResults(searcher, termQuery, topDocs);
>
>
> [/code]
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: payload similarity

2015-04-24 Thread Dmitry Kan
Answering my own question:

in order to account for payloads, PayloadTermQuery should be used instead
of TermQuery:

PayloadTermQuery payloadTermQuery = new PayloadTermQuery(new Term("body",
"dogs"), new MaxPayloadFunction());

Then in the query explanation we get:

---
Results for body:dogs of type:
org.apache.lucene.search.payloads.PayloadTermQuery
Doc: doc=0 score=3.125 shardIndex=-1
payloadValue = 10.0
Explain: 3.125 = (MATCH) btq, product of:
  0.3125 = weight(body:dogs in 0) [PayloadSimilarity], result of:
0.3125 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
  1.0 = idf(docFreq=3, maxDocs=10)
  0.3125 = fieldNorm(doc=0)
  10.0 = MaxPayloadFunction.docScore()

Doc: doc=9 score=3.125 shardIndex=-1
payloadValue = 10.0
Explain: 3.125 = (MATCH) btq, product of:
  0.3125 = weight(body:dogs in 9) [PayloadSimilarity], result of:
0.3125 = fieldWeight in 9, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
  1.0 = idf(docFreq=3, maxDocs=10)
  0.3125 = fieldNorm(doc=9)
  10.0 = MaxPayloadFunction.docScore()

Doc: doc=1 score=0.3125 shardIndex=-1
Explain: 0.3125 = (MATCH) btq, product of:
  0.3125 = weight(body:dogs in 1) [PayloadSimilarity], result of:
0.3125 = fieldWeight in 1, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
  1.0 = idf(docFreq=3, maxDocs=10)
  0.3125 = fieldNorm(doc=1)
  1.0 = MaxPayloadFunction.docScore()

On Fri, Apr 24, 2015 at 2:50 PM, Dmitry Kan  wrote:

> Hi,
>
>
> Using the approach here
> http://lucidworks.com/blog/getting-started-with-payloads/ I have
> implemented my own PayloadSimilarity class. When debugging the code I have
> noticed, that the scorePayload method is never called. What could be wrong?
>
>
> [code]
>
> class PayloadSimilarity extends DefaultSimilarity {
> @Override
> public float scorePayload(int doc, int start, int end, BytesRef payload) {
> float payloadValue = PayloadHelper.decodeFloat(payload.bytes);
> System.out.println("payloadValue = " + payloadValue);
> return payloadValue;
> }
> }
>
> [/code]
>
>
> Here is how the similarity is injected during indexing:
>
> [code]
>
> PayloadEncoder encoder = new FloatEncoder();
> IndexWriterConfig indexWriterConfig = new 
> IndexWriterConfig(Version.LUCENE_4_10_4, new PayloadAnalyzer(encoder));
> payloadSimilarity = new PayloadSimilarity();
> indexWriterConfig.setSimilarity(payloadSimilarity);
> IndexWriter writer = new IndexWriter(dir, indexWriterConfig);
>
> [/code]
>
>
> and during searching:
>
> [code]
>
> IndexReader indexReader = DirectoryReader.open(dir);
> IndexSearcher searcher = new IndexSearcher(indexReader);
> searcher.setSimilarity(payloadSimilarity);
>
> TermQuery termQuery = new TermQuery(new Term("body", "dogs"));
> termQuery.setBoost(1.1f);
> TopDocs topDocs = searcher.search(termQuery, 10);
> printResults(searcher, termQuery, topDocs);
>
>
> [/code]
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: payload similarity

2015-04-24 Thread Ahmet Arslan
Hi Dmitry,

I think, it is activated by PayloadTermQuery.

Ahmet



On Friday, April 24, 2015 2:51 PM, Dmitry Kan  wrote:
Hi,


Using the approach here
http://lucidworks.com/blog/getting-started-with-payloads/ I have
implemented my own PayloadSimilarity class. When debugging the code I have
noticed, that the scorePayload method is never called. What could be wrong?


[code]

class PayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
float payloadValue = PayloadHelper.decodeFloat(payload.bytes);
System.out.println("payloadValue = " + payloadValue);
return payloadValue;
}
}

[/code]


Here is how the similarity is injected during indexing:

[code]

PayloadEncoder encoder = new FloatEncoder();
IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_4_10_4, new
PayloadAnalyzer(encoder));
payloadSimilarity = new PayloadSimilarity();
indexWriterConfig.setSimilarity(payloadSimilarity);
IndexWriter writer = new IndexWriter(dir, indexWriterConfig);

[/code]


and during searching:

[code]

IndexReader indexReader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(indexReader);
searcher.setSimilarity(payloadSimilarity);

TermQuery termQuery = new TermQuery(new Term("body", "dogs"));
termQuery.setBoost(1.1f);
TopDocs topDocs = searcher.search(termQuery, 10);
printResults(searcher, termQuery, topDocs);


[/code]

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


ArrayIndexOutOfBoundsException in RecordingJSONParser.java

2015-04-24 Thread Scott Dawson
Hello,
I'm running Solr 5.1 and during indexing I get an
ArrayIndexOutOfBoundsException at line 61 of
org/apache/solr/util/RecordingJSONParser.java. Looking at the code (see
below), it seems obvious that the if-statement at line 60 should use a
greater-than sign instead of greater-than-or-equals.

  @Override
  public CharArr getStringChars() throws IOException {
CharArr chars = super.getStringChars();
recordStr(chars.toString());
position = getPosition();
// if reading a String , the getStringChars do not return the closing
single quote or double quote
//so, try to capture that
if(chars.getArray().length >=chars.getStart()+chars.size()) { //
line 60
  char next = chars.getArray()[chars.getStart() + chars.size()]; //
line 61
  if(next =='"' || next == '\'') {
recordChar(next);
  }
}
return chars;
  }

Should I create a JIRA ticket? (Am I allowed to?)  I can provide more info
about my particular usage including a stacktrace if that's helpful. I'm
using the new custom JSON indexing, which, by the way, is an excellent
feature and will be of great benefit to my project. Thanks for that.

Regards,
Scott Dawson


Re: SolrCloud to exclude xslt files in conf from zookeeper

2015-04-24 Thread Shawn Heisey
On 4/24/2015 4:54 AM, Kumaradas Puthussery Krishnadas wrote:
> I am creating a  SolrCloud with 4 solr instances and 5 zookeeper instances. I 
> need to make sure that querying is working even when my 3 zookeepers are 
> down. But it looks like the queries using json transformation based xslt 
> templates which is not available since the zookeeper ensemble is not 
> available.

Is it five zookeeper or three?  It sounds like you might have five, but
three of them are down for some reason.

When you have five zookeepers, you can lose two and maintain quorum.  If
you lose three, then zookeeper doesn't have enough nodes to work
properly, and SolrCloud will also stop normal operation.  This is a
fundamental property of zookeeper.  There must be a majority of nodes
operational -- more than half.

If you have three zookeepers, you can lose only one and still maintain
quorum.

Some information about zookeeper that isn't directly applicable to your
situation but may help explain why zookeeper behaves the way it does:
Exactly half of the total nodes is not enough.  If you have four
zookeepers, two is not enough for quorum, three of them must be
operational and able to communicate with each other.  This is to prevent
split-brain where two clusters are formed that cannot communicate with
each other but independently believe that they are the functional cluster.

> So is it possible to exclude files (eg: xslt folder) in the conf directory 
> from being loaded into Zookeeper rather point it to the filesystem so that 
> querying the solrcloud is not broken.

One of the major points of putting the config in zookeeper is to
centralize it and have zero reliance on local config files, which may be
different on each Solr instance.  Consider a cloud with five hundred
nodes.  A centralized config is the only way to be absolutely certain
that every node has the update.  You are welcome to file a feature
request in Jira for the capability you want, but you may encounter
resistance to actually getting it into Solr.

If you lose zookeeper quorum, then SolrCloud has no choice other than
stopping normal operation, to protect the integrity of the cloud.  Any
other action could lead to data loss.  Zookeeper is a fundamental part
of SolrCloud, so if your Zookeeper ensemble is not healthy, neither is
SolrCloud.

Thanks,
Shawn



AW: o.a.s.c.SolrException: missing content stream

2015-04-24 Thread Clemens Wyss DEV
Stupid me (yet again):
Should have taken a  TEXT instead of (only) a STRING field for the content ;)

Another question I have though (which fits the subject even better):
In the log I see many
org.apache.solr.common.SolrException: missing content stream
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
...
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)

What are possible reasons herfore?
Thx
Clemens
-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Freitag, 24. April 2015 14:01
An: solr-user@lucene.apache.org
Betreff: o.a.s.c.SolrException: missing content stream

Context: Solr/Lucene 5.1
Adding documents to Solr core/index through SolrJ

I extract pdf's using tika. The pdf-content is one of the fields of my 
SolrDocuments that are transmitted to Solr using SolrJ.
As not all documents seem to be "coming through" I looked into the Solr-logs 
and see the follwoing exceptions:
org.apache.solr.common.SolrException: Exception writing document id 
fustusermanuals#4614 to the index; possible analysis error.
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
...
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source) Caused by: 
java.lang.IllegalArgumentException: Document contains at least one immense term 
in field="content__s_i_suggest" (whose UTF8 encoding is longer than the max 
length 32766), all of which were skipped.  Please correct the analyzer to not 
produce such terms.  The prefix of the first immense term is: '[10, 32, 10, 32, 
10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109, 112, 108, 111, 
105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message: bytes can be at 
most 32766 in length; got 186493
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: 
bytes can be at most 32766 in length; got 186493
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingCh

Re: Simple search low speed

2015-04-24 Thread Joel Bernstein
Try breaking down the query to see which part of it is slow. If it turns
out to be the range query you may want to look into using an frange
postfilter.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 24, 2015 at 6:50 AM, Norgorn  wrote:

> Thanks for your reply.
>
> Yes, 100% CPU is used by SOLR (100% - I mean 1 core, not all cores), I'm
> totally sure.
>
> I have more than 80 GB RAM on test machine and about 50 is cached as disk
> cache, SOLR uses about 8, Xmx=40G.
>
> I use GC1, but it can't be the problem, cause memory usage is much lower
> than GC start limit (45% of heap).
>
> I think, the problem can be in fully optimized index, and search over one
> big segment is much slower than parallel search over lot of segments, but
> it
> sounds weird, so I'm not sure.
> Setups with big indexes which I know are all with optimized indexes.
>
> Index scheme:
>  termVectors="true" termPositions="true" termOffsets="true" />
> termVectors="true" termPositions="true" termOffsets="true" />
>
>  multiValued="false" required="true" omitNorms="true"
> omitTermFreqAndPositions="true"/>
>
>  omitNorms="true" omitTermFreqAndPositions="true"/>
>
>  required="false" omitNorms="true" omitTermFreqAndPositions="true"/>
>  required="false" omitNorms="true" omitTermFreqAndPositions="true"/>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202157.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


o.a.s.c.SolrException: missing content stream

2015-04-24 Thread Clemens Wyss DEV
Context: Solr/Lucene 5.1
Adding documents to Solr core/index through SolrJ

I extract pdf's using tika. The pdf-content is one of the fields of my 
SolrDocuments that are transmitted to Solr using SolrJ.
As not all documents seem to be "coming through" I looked into the Solr-logs 
and see the follwoing exceptions:
org.apache.solr.common.SolrException: Exception writing document id 
fustusermanuals#4614 to the index; possible analysis error.
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
...
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer 
than the max length 32766), all of which were skipped.  Please correct the 
analyzer to not produce such terms.  The prefix of the first immense term is: 
'[10, 32, 10, 32, 10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109, 
112, 108, 111, 105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message: 
bytes can be at most 32766 in length; got 186493
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: 
bytes can be at most 32766 in length; got 186493
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
... 47 more

How can I tell Solr/SolrJ to allow more payload?

I also see some
org.apache.solr.common.SolrException: Exception writing document id 
fustusermanuals#3323 to the index; possible analysis error.
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:697)
...
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer 
than the max length 32766), all of which were skipped.  Please correct the 
analyzer to not produce such terms.  The prefix of the first immense term is: 
'[10, 69, 78, 32, 76, 67, 68, 32, 116, 101, 108, 101, 118, 105, 115, 105, 111, 
110, 10, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95]...', original message: 
bytes can be at most 32766 in length; got 164683
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(Do

payload similarity

2015-04-24 Thread Dmitry Kan
Hi,


Using the approach here
http://lucidworks.com/blog/getting-started-with-payloads/ I have
implemented my own PayloadSimilarity class. When debugging the code I have
noticed, that the scorePayload method is never called. What could be wrong?


[code]

class PayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
float payloadValue = PayloadHelper.decodeFloat(payload.bytes);
System.out.println("payloadValue = " + payloadValue);
return payloadValue;
}
}

[/code]


Here is how the similarity is injected during indexing:

[code]

PayloadEncoder encoder = new FloatEncoder();
IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_4_10_4, new
PayloadAnalyzer(encoder));
payloadSimilarity = new PayloadSimilarity();
indexWriterConfig.setSimilarity(payloadSimilarity);
IndexWriter writer = new IndexWriter(dir, indexWriterConfig);

[/code]


and during searching:

[code]

IndexReader indexReader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(indexReader);
searcher.setSimilarity(payloadSimilarity);

TermQuery termQuery = new TermQuery(new Term("body", "dogs"));
termQuery.setBoost(1.1f);
TopDocs topDocs = searcher.search(termQuery, 10);
printResults(searcher, termQuery, topDocs);


[/code]

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


SolrCloud to exclude xslt files in conf from zookeeper

2015-04-24 Thread Kumaradas Puthussery Krishnadas
I am creating a  SolrCloud with 4 solr instances and 5 zookeeper instances. I 
need to make sure that querying is working even when my 3 zookeepers are down. 
But it looks like the queries using json transformation based xslt templates 
which is not available since the zookeeper ensemble is not available.

So is it possible to exclude files (eg: xslt folder) in the conf directory from 
being loaded into Zookeeper rather point it to the filesystem so that querying 
the solrcloud is not broken.

Thanks
Kumar


Re: Simple search low speed

2015-04-24 Thread Norgorn
Thanks for your reply.

Yes, 100% CPU is used by SOLR (100% - I mean 1 core, not all cores), I'm
totally sure.

I have more than 80 GB RAM on test machine and about 50 is cached as disk
cache, SOLR uses about 8, Xmx=40G.

I use GC1, but it can't be the problem, cause memory usage is much lower
than GC start limit (45% of heap).

I think, the problem can be in fully optimized index, and search over one
big segment is much slower than parallel search over lot of segments, but it
sounds weird, so I'm not sure.
Setups with big indexes which I know are all with optimized indexes.

Index scheme:

   










--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202157.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Simple search low speed

2015-04-24 Thread Tomasz Borek
Java side:
- launch jvisualvm
- see how heap and CPU are occupied

What are your JVM settings (heap) and how much RAM do you have?

The CPU100% is used only by Solr? That is, are you 100% certain it's Solr
that drives CPU to it's limit?

pozdrawiam,
LAFK

2015-04-24 12:14 GMT+02:00 Norgorn :

> The number of documents in collection is about 100m.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202152.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Checking of Solr Memory and Disk usage

2015-04-24 Thread Tom Evans
On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> So has anyone knows what is the issue with the "Heap Memory Usage" reading
> showing the value -1. Should I open an issue in Jira?

I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers
the core statistics have values for heap memory, on the solr 5.0.0
ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK
on both versions.

I don't see this issue in the fixed bugs in 5.1.0, but I only looked
at the headlines of the tickets..

http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes

Cheers

Tom


Re: Simple search low speed

2015-04-24 Thread Norgorn
The number of documents in collection is about 100m.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135p4202152.html
Sent from the Solr - User mailing list archive at Nabble.com.


require diversity in results?

2015-04-24 Thread Paul Libbrecht
Hello list,

I'm wondering if there could extra parameters or query operators that
where I could impose that sorting by relevance should be relaxed so that
there's a minimum diversity in some fields in the first page of results.

For example, I'd like the search results to contain at least three
possible type of resources in the first page, fetching things from below
if needed.
I know that could be done as a search result post-processor but I think
that this is generally a bad idea for performance.

Any other idea?

thanks

Paul



signature.asc
Description: OpenPGP digital signature


AW: Odp.: solr issue with pdf forms

2015-04-24 Thread Steve.Scholl
Hey Erick,

thanks a lot for your answer. I went to the admin schema browser, but what 
should I see there? Sorry I'm not firm with the admin schema browser. :-(

Best
Steve


-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Donnerstag, 23. April 2015 18:00
An: solr-user@lucene.apache.org
Betreff: Re: Odp.: solr issue with pdf forms

When you say "they're not indexed correctly", what's your evidence?
You cannot rely
on the display in the browser, that's the raw input just as it was sent to 
Solr, _not_ the actual tokens in the index. What do you see when you go to the 
admin schema browser pate and load the actual tokens.

Or use the TermsComponent
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
to see the actual terms in the index as opposed to the stored data you see in 
the browser when you look at search results.

If the actual terms don't seem right _in the index_ we need to see your 
analysis chain, i.e. your fieldType definition.

I'm, 90% sure you're seeing the stored data and your terms are indexed just 
fine, but I've certainly been wrong before, more times than I want to 
remember.

Best,
Erick

On Thu, Apr 23, 2015 at 1:18 AM,   wrote:
> Hey Erick,
>
> thanks for your answer. They are not indexed correctly. Also throught the 
> solr admin interface I see these typical questionmarks within a rhombus where 
> a blank space should be.
> I now figured out the following (not sure if it is relevant at all):
> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are 
> indexed correctly, no issues
> - PDF documents (with editable form fields) created with "Adobe 
> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>
> Best
> Steve
>
> -Ursprüngliche Nachricht-
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Mittwoch, 22. April 2015 17:11
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Are they not _indexed_ correctly or not being displayed correctly?
> Take a look at admin UI>>schema browser>> your field and press the "load 
> terms" button. That'll show you what is _in_ the index as opposed to what the 
> raw data looked like.
>
> When you return the field in a Solr search, you get a verbatim, un-analyzed 
> copy of your original input. My guess is that your browser isn't using the 
> compatible character encoding for display.
>
> Best,
> Erick
>
> On Wed, Apr 22, 2015 at 7:08 AM,   wrote:
>> Thanks for your answer. Maybe my English is not good enough, what are you 
>> trying to say? Sorry I didn't get the point.
>> :-(
>>
>>
>> -Ursprüngliche Nachricht-
>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
>> Gesendet: Mittwoch, 22. April 2015 14:01
>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>> Betreff: Odp.: solr issue with pdf forms
>>
>> Out of my head I'd follow how are writable PDFs created and encoded.
>>
>> @LAFK_PL
>>   Oryginalna wiadomość
>> Od: steve.sch...@t-systems.com
>> Wysłano: środa, 22 kwietnia 2015 12:41
>> Do: solr-user@lucene.apache.org
>> Odpowiedz: solr-user@lucene.apache.org
>> Temat: solr issue with pdf forms
>>
>> Hi guys,
>>
>> hopefully you can help me with my issue. We are using a solr setup and have 
>> the following issue:
>> - usual pdf files are indexed just fine
>> - pdf files with writable form-fields look like this:
>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und 
>> v ollständig sind
>>
>> Somehow the blank space character is not indexed correctly.
>>
>> Is this a know issue? Does anybody have an idea?
>>
>> Thanks a lot
>> Best
>> Steve


Simple search low speed

2015-04-24 Thread Norgorn
We have simple search over 50 GB index.
And it's slow.
I can't even wonder why, whole index is in RAM (and a lot of free space is
available) and CPU is a bottleneck (100% load).

The query is simple (except tvrh):

q=(text:(word1+word2)++title:(word1+word2))&tv=true&isShard=true&qt=/tvrh&fq=cat:(10+11+12)&fq=field1:(150)&fq=field2:(0)&fq=date:[2015-01-01T00:00:00Z+TO+2015-04-24T23:59:59Z]

text, title - text_general fields
cat, field1, field2 - tint fields
date - a date field (I know, it's deprecated, will be changed soon).
All fields are indexed, some of them are stored.

And search time is 15 seconds (for warmed searcher, it's not the first
query).

debug=true shows timings process={time=15382.0,query={time=15282.0}

What can I check?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simple-search-low-speed-tp4202135.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Grouping Performance Optimation

2015-04-24 Thread Norgorn
If u need only 200 results grouped, u can easily do it with some external
code, it will be much faster anyway.
Also, it's widely suggested to use docValues="true" for fields, by which
group is performed, it really helps (I can only say numbers in terms of RAM
usage, but speed increases as-well).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-Performance-Optimation-tp4201886p4202133.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Checking of Solr Memory and Disk usage

2015-04-24 Thread Zheng Lin Edwin Yeo
Hi,

So has anyone knows what is the issue with the "Heap Memory Usage" reading
showing the value -1. Should I open an issue in Jira?

Regards,
Edwin


On 22 April 2015 at 21:23, Zheng Lin Edwin Yeo  wrote:

> I see. I'm running on SolrCloud with 2 replicia, so I guess mine will
> probably use much more when my system reaches millions of documents.
>
> Regards,
> Edwin
>
>
> On 22 April 2015 at 20:47, Shawn Heisey  wrote:
>
>> On 4/22/2015 12:11 AM, Zheng Lin Edwin Yeo wrote:
>> > Roughly how many collections and how much records do you have in your
>> Solr?
>> >
>> > I have 8 collections with a total of roughly 227000 records, most of
>> which
>> > are CSV records. One of my collections have 142000 records.
>>
>> The core that shows 82MB for heap usage has 16 million documents and is
>> hit with an average of 1 or 2 queries per second.  The entire Solr
>> instance on this machine has about 55 million documents and a 6GB max
>> heap.
>>
>> This is NOT running SolrCloud, though the indexes are distributed.
>> There are 24 cores defined, but during normal operation, only four of
>> them contain documents.  All four of those cores show heap memory values
>> less than 100MB, but the overall heap usage on that machine is measured
>> in gigabytes.
>>
>> Thanks,
>> Shawn
>>
>>
>