date:20150416

On 4/16/2015 7:09 AM, Steven White wrote:
 I cannot use escapeQueryChars method because my app interacts with Solr via
 REST.
 
 The summary of your email is: client's must escape search string to prevent
 Solr from failing.
 
 It would be a nice addition to Solr to provide a new query parameter that
 tells it to treat the query text as literal text.  Doing so, means you
 remove the burden placed on clients to understand and escape reserved Solr
 / Lucene tokens.

That's a good idea, although we might already have that.

I wonder what happens if you include defType=term with your request?
That works for edismax, it might work for other query parsers, at least
on the q parameter.

Thanks,
Shawn

Re: check If I am Still Leader

On 4/16/2015 7:08 AM, Adir Ben Ami wrote:
 I am using Solr 4.10.0 with tomcat and embedded Zookeeper.
 I use SolrCloud in my system.

 Each Shard machine try to reach/connect with other cluster machines in order 
 to index the document ,it just checks if it is still the leader.
  I don't use replication so why does it has to check who is the leader?
 How can I bypass this constraint and make my solrcloud not use 
 ClusterStateUpdater.checkIfIamStillLeader when i am indexing?

You might not need that functionality, but Solr must address the general
case, which includes multiple replicas for each shard,where one of them
will be leader.

I hope this is a test installation ... running in production without
fault tolerance is a bad idea.  Using the embedded zookeeper in
production is another bad idea, for the same reason - fault tolerance.

You can file an issue in Jira for a configuration mode where the leader
check is disabled.  I would oppose having that happen automatically ...
another replica could be added to the cloud at any time.

Thanks,
Shawn

Re: check If I am Still Leader

On 4/16/2015 7:42 AM, Adir Ben Ami wrote:
 I have not mentioned before that the index are always routed to specific 
 machine.
 Is there a way to avoid connectivity from the node to all other nodes? 

That capability has been added in Solr 5.1.0.

https://issues.apache.org/jira/browse/SOLR-6832

Thanks,
Shawn

Batch collecting in PostFilter

2015-04-16 Thread ha.pham

Hi all,

I am implementing a PostFilter following this article
https://lucidworks.com/blog/custom-security-filtering-in-solr/

We have a requirement to call the external system only once for all the 
documents (max 200) so below is my change:

-don't call super.collect(docId) in the collect method of the PostFilter but 
store all docIds in an internal map

-call the external system in the finish() then call super.collect(docId) for 
all the docs that pass the external filtering

The problem I have: docId exceeds maxDoc (docID must be = 0 and  
maxDoc=10 (got docID=123456)

I suspect I am storing local docIds and when Reader is changed, docBase is also 
changed so the global docId, which I believe is constructed in super.collect() 
using the parameter docId and docBase, becomes incorrect.

Could anyone point me to the right direction?

Thanks,

-Ha

Re: Differentiating user search term in Solr

On 4/16/2015 7:49 AM, Steven White wrote:
 defType didn't work:


 http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene

 Gave me error:

 org.apache.solr.search.SyntaxError: Expected identifier at pos 27
 str='{!q.op=AND df=text solr sys'

 Is my use of defType correct?

If everything is at defaults and you don't have defType in the handler
definition, then defType=lucene doesn't do anything - it specifically
says use the lucene parser which is the default.  You want
defType=term instead.

Thanks,
Shawn

Re: How can I temporarily detach node from SolrCloud?

On 4/16/2015 8:27 AM, Oded Sofer wrote:
 How can I detach node from SolrCloud (temporarily for maintenance and such 
 and attach it back after some time). We are using SolrCloud 4.10.0; One 
 Collection, and Shard per node. 
 The add-index is routed to specific machine base on our customize routing 
 logic (kind of hard-coded) 

I assume this is just one replica out of multiple ... if that's the
case, just shut the node down, do your maintenance, and bring it back
online.  SolrCloud will automatically make sure the index replica(s) on
the node are brought up to date to match the others.

If it's not one replica of multiple (that is, if it has the only copy of
one or more shards), then shutting it down will either reduce your
result set or cause queries to return an error, not sure which.

Thanks,
Shawn

Conditional Filter Queries

2015-04-16 Thread Tao, Jing

Hi,

I want to filter my search results by different date fields based on content 
type.
In other words: if contentType is A, filter out results that are older than 1 
year; if contentType is B, filter out results that are older than 2 years; 
otherwise, date does not matter.

Is that possible with fq parameters?
Would it be something like  fq=(contentType:A AND startDate:[NOW-1YEAR TO 
NOW]) OR (contentType:B AND startDate:[NOW-2YEAR TO NOW]) OR !contentType: 
(A or B)

Is there a better way to do this?

Thanks,
Jing

Re: Differentiating user search term in Solr

What is term in the defType=term, do you mean the raw word term or
something else?  Because I tried that too in two different ways:

Using correct Solr syntax:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text}%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term

This throws a NPE exception:

java.lang.NullPointerException at

org.apache.solr.schema.IndexSchema$DynamicReplacement$DynamicPattern$NameEndsWith.matches(IndexSchema.java:1033)
at

org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:1047)
at
org.apache.solr.schema.IndexSchema.dynFieldType(IndexSchema.java:1303)
at

org.apache.solr.schema.IndexSchema.getFieldTypeNoEx(IndexSchema.java:1280)
at

org.apache.solr.search.TermQParserPlugin$1.parse(TermQParserPlugin.java:56)
at

And when I try it with invalid Solr search syntax:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term


This gives me the SyntaxError:

org.apache.solr.search.SyntaxError: Expected identifier at pos 27
str='{!q.op=AND df=text solr sys'

What am I missing?

Steve

On Thu, Apr 16, 2015 at 10:43 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 7:49 AM, Steven White wrote:
  defType didn't work:
 
 
 
 http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene
 
  Gave me error:
 
  org.apache.solr.search.SyntaxError: Expected identifier at pos 27
  str='{!q.op=AND df=text solr sys'
 
  Is my use of defType correct?

 If everything is at defaults and you don't have defType in the handler
 definition, then defType=lucene doesn't do anything - it specifically
 says use the lucene parser which is the default.  You want
 defType=term instead.

 Thanks,
 Shawn

Re: Differentiating user search term in Solr

On 4/16/2015 9:37 AM, Steven White wrote:
 What is term in the defType=term, do you mean the raw word term or
 something else?  Because I tried that too in two different ways:

Oops.  I forgot that the term query parser (that's what term means --
the name of the query parser) requires that you specify the field you
are searching on, so that would be incomplete.  Try also setting the f
parameter to the field that you want to search.  I will not be surprised
if that doesn't work, though.

Thanks,
Shawn

Re: Merge indexes in MapReduce

You're stating two things that are somewhat antithetical:
1: We have real-time search and
2: want to merge (and optimize) its indexes into one

Needing to merge indexes implies (to me at least) that
you're not really doing NRT processing as docs in the batch
you're merging into your collection aren't searchable, thus not NRT.

I'm probably missing something obvious in your problem statement

The MapReduceIndexerTool probably doesn't quite do what you want
as its purpose is to add documents to the index and merge at the end...

You might get some value from the core admin API MERGEINDEXES call:
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-MERGEINDEXES

But you have to be careful in a sharded situation to merge exactly
correctly. Plus,
merging indexes does NOT replace documents with a particular
uniqueKey that happens to be both in the source and dest indexes.

I wouldn't worry too much about optimization, despite its name it's
largely irrelevant at this point
unless you have a bunch of deleted documents in your index.

Best,
Erick


On Thu, Apr 16, 2015 at 4:14 AM, Norgorn lsunnyd...@mail.ru wrote:
 Is there a ready-to-use tool to merge existing indexes in map-reduce?
 We have real-time search and want to merge (and optimize) its indexes into
 one, so we don't need to build index in Map-Reduce, but only merge it.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Merge-indexes-in-MapReduce-tp4200106.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Differentiating user search term in Solr

On 4/16/2015 10:10 AM, Steven White wrote:
 I don't follow what the f parameter is.  Do you have a link where I can
 read more about it?  I found this
 https://wiki.apache.org/solr/HighlightingParameters and
 https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
 what you mean (I'm not doing highlighting for faceting).

It looks like this isn't going to work.  I just tried it on my index.

To see the reasoning behind what I was suggesting, click here:

https://cwiki.apache.org/confluence/display/solr/Other+Parsers

And then click on Term Query Parser in the third column of the list at
the top of that page.

The syntax for the localparams on this one is {!term f=field}querytext
... so I was hoping that f would work as a URL parameter, but from the
test I just did on Solr 4.9.1, that's not the case.

Thanks,
Shawn

Re: 1:M connectivity

You say the SolrCloud API. Not entirely sure what that is, do you
mean the post.jar tool?

Because to get much more scalable throughput, you probably want to use SolrJ and
the CloudSolrServer class. That class takes a connection to Zookeeper and
does the right thing.

Best,
Erick

On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid wrote:
 Given that the index are always routed to specific machine, is there a way to 
 avoid connectivity from the node to all other node.
 We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
 added to the node that get API request for add-index (i.e., we are sending 
 the add index to the appropriate node that should get it).

Re: SolrCloud - Collection Browsing

Check that your config has a valid path to the velocity contrib. You
should see something like

lib dir=${solr.install.dir:../../..}/contrib/velocity/lib regex=.*\.jar /

(from Solr 4.10). and you should also see the indicated file on each
of your Solr nodes.

What's the full stack BTW? I'm expecting something like a class not
found error somewhere
down in the stack.

Best,
Erick

On Thu, Apr 16, 2015 at 3:21 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
 Hi,

 I have setup a SolrCloud on 3 machines - machine1, machine2 and machine3.
 The DirectoryFactory used is HDFS where the collection index data is stored
 in HDFS within a Hadoop cluster.

 SlorCloud has been setup successfully and everything looks fine so far. I
 have uploaded the default configuration i.e. the conf folder under
 example/collection1 folder under the solr installation directory into
 Zookeeper. Essentially, I have uploaded the default configuration into
 Zookeeper.

 Now when I log in to Solr Admin using http://machine1:8983/solr/admin, I am
 able to see the SolrAdmin page and when I click on Cloud, I could see all
 the shards and replications properly in the browser.

 However, the issue comes when I try to open the page
 http://machine1:8983/solr/mycollection/browse. I am seeing a HTTP 500 lazy
 loading error. This looks like a trivial mistake somewhere as the
 collection is setup fine and everything works normal. However, when I
 browse the collection, this error occurs. Even when I open
 http://machine1:8983/solr/mycollection/query I am getting the json response
 properly with numFound as 0

 I was expecting similar behavior like how the /browse request provides the
 Solritas page.

 Note: I haven't changed any of the configuration in the conf directory.
 Should I modify solrconfig.xml to have a RequestHandler for
 /mycollection/browse or the default one be sufficient?

 Can someone provide some pointers please to get this issue resolved?

 Thanks  Regards
 Vijay

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]

If you use pdftotext with a simple fork/exec per document, you will get about 5 
MB/s throughput on a single AMD x86_64.   Much of that is because of the 
fork/exec.   I suggest that you use HTML output and UTF-8 encoding  for the 
PDF, because that way you can get title/keywords and such as http meta keywords.

If you have the appetite for something truly great, try:
 - Socket server listening for parsing requests
 - pass off accept() sockets to pre-forked children
 - in the children, use vfork, rather than fork
 -  tmpfs for outputted HTML documents
 - Tempting to implement using mod_perl and httpd, at least to me.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
Sent: Thursday, April 16, 2015 7:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never ever 
successfully all real-world PDFs and cater for that fact in your requirements 
:-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:
 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as 
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word 
 documents are getting parsed properly and indexed into Solr. However, 
 a minority of them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents 
 can be indexed. The business requirement we have is to index all the 
 documents.
 However, if a small percentage of them fails, not sure what other ways 
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to 
 Solr, in a production environment the Solr server is responsible for 
 indexing, parsing the docs through Tika, perhaps searching etc. This 
 doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally 
 independent of what version of Tika is on the Solr server. Here's an 
 example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy 
 vijaya.bhoomire...@whishworks.com wrote:
 Thanks everyone for the responses. Now I am able to index PDF 
 documents successfully. I have implemented manual extraction using 
 Tika's
 AutoParser
 and PDF functionality is working fine. However,  the error with some 
 MS office word documents still persist.

 The error message is java.lang.IllegalArgumentException: This 
 paragraph
 is
 not the first one in the table which will eventually result in
 Unexpected
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

 Upon some reading, it looks like its a bug with Tika 1.5 and seems 
 to
 have
 been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
 I am new to Solr / Tika and hence wondering whether I can change the 
 Tika library alone to v1.6 without impacting any of the libraries 
 within Solr 4.10.2? Please let me know your response and how to get 
 away with this issue.

 Many thanks in advance.

 Thanks  Regards
 Vijay


 On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:

 Vijay,

 You could try different excel files with different formats to rule 
 out
 the
 issue is with TIKA version being used.

 Thanks
 Murthy

 On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
 trhodes...@gmail.com
 wrote:

 Perhaps the PDF is protected and the content can not be extracted?

 i have an unverified suspicion that the tika shipped with solr 
 4.10.2
 may
 not support some/all office 2013 document formats.





 On 4/14/2015 8:18 PM, Jack Krupansky wrote:

 Try doing a manual extraction request directly to Solr (not via
 SolrJ)
 and
 use the extractOnly option to see if the content is actually
 extracted.

 See:
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Data+with+Solr+Cell+using+Apache+Tika

 Also, some PDF files actually have the content as a bitmap image, 
 so
 no
 text is extracted.


 -- Jack Krupansky

 On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi 
 Reddy
 
 vijaya.bhoomire...@whishworks.com wrote:

   Hi,

 I am trying to index PDF and Microsoft Office files (.doc, 
 .docx,
 .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the 
 following
 issues.
 Request to please let me know what is going wrong with the 
 indexing process.

 I am using solr 4.10.2 and using the default example server
 configuration
 that

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]

Indeed.   Another solution is to purchase ABBYY or Nuance as a server, and have 
them do that work.
You will even get OCR.Both offer a Linux SDK.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, April 16, 2015 7:56 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing PDF and MS Office files

+1

:)

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)

Re: Differentiating user search term in Solr

I don't follow what the f parameter is.  Do you have a link where I can
read more about it?  I found this
https://wiki.apache.org/solr/HighlightingParameters and
https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
what you mean (I'm not doing highlighting for faceting).

Thanks

Steve

On Thu, Apr 16, 2015 at 11:54 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 9:37 AM, Steven White wrote:
  What is term in the defType=term, do you mean the raw word term or
  something else?  Because I tried that too in two different ways:

 Oops.  I forgot that the term query parser (that's what term means --
 the name of the query parser) requires that you specify the field you
 are searching on, so that would be incomplete.  Try also setting the f
 parameter to the field that you want to search.  I will not be surprised
 if that doesn't work, though.

 Thanks,
 Shawn

Re: check If I am Still Leader

bq:  I don't use replication so why does it has to check who is the leader

Because the doc must be routed to the correct shard, and the shard leader
is the machine that coordinates the indexing for that shard.

I really question whether this is a fruitful course for you to take. What
specific problems are you trying to solve here? Because trying to take control
at this level really shouldn't be done unless and until you have a problem
that's causing you grief, it's just a waste of energy until then IMO.

Best,
Erick

On Thu, Apr 16, 2015 at 7:59 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 4/16/2015 7:42 AM, Adir Ben Ami wrote:
 I have not mentioned before that the index are always routed to specific 
 machine.
 Is there a way to avoid connectivity from the node to all other nodes?

 That capability has been added in Solr 5.1.0.

 https://issues.apache.org/jira/browse/SOLR-6832

 Thanks,
 Shawn

Re: Differentiating user search term in Solr

On 4/16/2015 10:18 AM, Shawn Heisey wrote:
 On 4/16/2015 10:10 AM, Steven White wrote:
 I don't follow what the f parameter is.  Do you have a link where I can
 read more about it?  I found this
 https://wiki.apache.org/solr/HighlightingParameters and
 https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
 what you mean (I'm not doing highlighting for faceting).
 It looks like this isn't going to work.  I just tried it on my index.

I filed an enhancement issue.  It might never happen, but it's in the
system.

https://issues.apache.org/jira/browse/SOLR-7410

Thanks,
Shawn

Re: Differentiating user search term in Solr

Thanks for trying Shawn.

Looks like I have to escape the string on my client side (this isn't a
clean design and can lead to errors if not all reserved tokens are not
escaped).

I hope folks from @dev are reading this and consider adding a parameter to
tell Solr the text is raw-text.

Steve

On Thu, Apr 16, 2015 at 12:18 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 10:10 AM, Steven White wrote:
  I don't follow what the f parameter is.  Do you have a link where I can
  read more about it?  I found this
  https://wiki.apache.org/solr/HighlightingParameters and
  https://wiki.apache.org/solr/SimpleFacetParameters but im not sure
 this is
  what you mean (I'm not doing highlighting for faceting).

 It looks like this isn't going to work.  I just tried it on my index.

 To see the reasoning behind what I was suggesting, click here:

 https://cwiki.apache.org/confluence/display/solr/Other+Parsers

 And then click on Term Query Parser in the third column of the list at
 the top of that page.

 The syntax for the localparams on this one is {!term f=field}querytext
 ... so I was hoping that f would work as a URL parameter, but from the
 test I just did on Solr 4.9.1, that's not the case.

 Thanks,
 Shawn

Re: How can I temporarily detach node from SolrCloud?

bq: it down will either reduce your result set or cause queries to
return an error

Setting shards.tolerant=true will reduce your result set. If you don't set that
and all replicas of a shard are down, you'll get an error.

And indexing won't work if all the replicas for a shard are down.

Best,
Erick


On Thu, Apr 16, 2015 at 7:46 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 4/16/2015 8:27 AM, Oded Sofer wrote:
 How can I detach node from SolrCloud (temporarily for maintenance and such 
 and attach it back after some time). We are using SolrCloud 4.10.0; One 
 Collection, and Shard per node.
 The add-index is routed to specific machine base on our customize routing 
 logic (kind of hard-coded)

 I assume this is just one replica out of multiple ... if that's the
 case, just shut the node down, do your maintenance, and bring it back
 online.  SolrCloud will automatically make sure the index replica(s) on
 the node are brought up to date to match the others.

 If it's not one replica of multiple (that is, if it has the only copy of
 one or more shards), then shutting it down will either reduce your
 result set or cause queries to return an error, not sure which.

 Thanks,
 Shawn

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy

For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.

For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.

Thanks Regards
Vijay

On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks Tim.

I shall raise a Jira with the stack trace information.

Thanks Regards
Vijay

On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

This sounds like a Tika issue, let's move discussion to that list.

Tika is not perfect and it will fail on some files, but we are always
working to improve it.

Best,

Tim

Thanks Allison.

These files normally open in Adobe Reader and MS Office tools.

Thanks Regards
Vijay

On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org
wrote:

I entirely agree with Erick -- it is best to isolate Tika in its own jvm
if you can -- bad things can happen if you don't [1] [2].

Erick's blog on SolrJ is fantastic. If you want to have Tika parse
embedded documents/attachments, make sure to set the parser in the
ParseContext before parsing:

ParseContext context = new ParseContext();
//add this line:
context.set(Parser.class, _autoParser)
InputStream input = new FileInputStream(file);

Tika 1.8 is soon to be released. If that doesn't fix your problems,
please submit stacktraces (and docs, if possible) to the Tika jira, and
we'll try to make the fixes.

Cheers,

Tim

[1]

http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
[2]

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority
of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents
can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?

Thanks Regards
Vijay

On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
wrote:

There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to
Solr,
in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that
well.

So an alternative is to use SolrJ with Tika, which is totally
independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
Thanks everyone for the responses. Now I am able to index PDF
documents
successfully. I have implemented manual extraction using Tika's
AutoParser
and PDF functionality is working fine. However, the error with
some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This
paragraph
is
not the first one in the table which will

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull


On 16/04/2015 12:53, Siegfried Goeschl wrote:

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)


Here's some file extractors we built a while ago:
https://github.com/flaxsearch/flaxcode/tree/master/flax_filters
You might find them useful: they use a number of external programs 
including pdf2text and headless Open Office.


Cheers

Charlie


If you start command line tools from your JVM please have a look at
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never
ever successfully all real-world PDFs and cater for that fact in your
requirements :-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents
can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to
Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that
well.

So an alternative is to use SolrJ with Tika, which is totally
independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This
paragraph

is

not the first one in the table which will eventually result in

Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the
Tika
library alone to v1.6 without impacting any of the libraries within
Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy



vijaya.bhoomire...@whishworks.com wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread

How can I temporarily detach node from SolrCloud?

2015-04-16 Thread Oded Sofer

How can I detach node from SolrCloud (temporarily for maintenance and such and 
attach it back after some time). We are using SolrCloud 4.10.0; One Collection, 
and Shard per node. 
The add-index is routed to specific machine base on our customize routing logic 
(kind of hard-coded)

Re: Differentiating user search term in Solr

Thanks Shawn.

I cannot use escapeQueryChars method because my app interacts with Solr via
REST.

The summary of your email is: client's must escape search string to prevent
Solr from failing.

It would be a nice addition to Solr to provide a new query parameter that
tells it to treat the query text as literal text. Doing so, means you
remove the burden placed on clients to understand and escape reserved Solr
/ Lucene tokens.

Steve

On Wed, Apr 15, 2015 at 7:18 PM, Shawn Heisey apa...@elyograg.org wrote:

On 4/15/2015 3:54 PM, Steven White wrote:
Hi folks,

If a user types in the search box (without quotes): {!q.op=AND df=text
solr sys and I take that text and build the URL like so:

http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=true

This will fail with Expected identifier because it is not a valid Solr
text.

That isn't valid syntax for the lucene query parser ... the localparams
are not closed (it would require a } character), and after the
localparams there would need to be some additional text.

My question is this: is there a flag I can send to Solr with the URL
telling it to treat what's in q as raw text vs. having it to process it
as a Solr syntax? If not, than it means I have to escape all Solr
reserved
characters and words. If so, where can I find the complete list? Also,
what happens when a new reserved characters or word is added to Solr down
the road? It means I have to upgrade my application too, which is
something I would like to avoid.

One way to treat the entire input as literal text is to use the terms
query parser ... but that requires the localparams syntax, and I do not
know exactly what is going to happen if you use a query string that
itself is localparams syntax -- {! other params} ... so escaping is
probably safer.

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

The other way to handle it is to escape every special character with a
backslash. The escapeQueryChars method in SolrJ is always kept up to
date, and can escape every special character.

http://lucene.apache.org/solr/4_10_3/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29

The javadoc for that method points to the queryparser syntax for more
info on characters that need escaping. Scroll to the very end of this
page:

http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true

That page lists || and rather than just the single characters | and
... the escapeQueryChars method in SolrJ will escape both characters, as
it only works at the character level, not the string level.

If you want the *spaces* in your query to be treated literally also, you
must escape them too. The escapeQueryChars method I've mentioned will
NOT escape spaces.

Note that this does not cover URL escaping -- the character must be
sent as %26 or the servlet container will treat it as a special
character, before it even gets to Solr.

Thanks,
Shawn

RE: check If I am Still Leader

2015-04-16 Thread Adir Ben Ami








I have not mentioned before that the index are always routed to specific 
machine.
Is there a way to avoid connectivity from the node to all other nodes? 



 From: adi...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: check If I am Still Leader
 Date: Thu, 16 Apr 2015 16:08:15 +0300
 
 
 Hi,
 
 I am using Solr 4.10.0 with tomcat and embedded Zookeeper.
 I use SolrCloud in my system.
 
 Each Shard machine try to reach/connect with other cluster machines in order 
 to index the document ,it just checks if it is still the leader.
  I don't use replication so why does it has to check who is the leader?
 How can I bypass this constraint and make my solrcloud not use 
 ClusterStateUpdater.checkIfIamStillLeader when i am indexing?
 
 Thanks,
 Adir.

1:M connectivity

2015-04-16 Thread Oded Sofer

Given that the index are always routed to specific machine, is there a way to 
avoid connectivity from the node to all other node. 
We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
added to the node that get API request for add-index (i.e., we are sending the 
add index to the appropriate node that should get it).

Re: Differentiating user search term in Solr

defType didn't work:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene

Gave me error:

org.apache.solr.search.SyntaxError: Expected identifier at pos 27
str='{!q.op=AND df=text solr sys'

Is my use of defType correct?

Steve

On Thu, Apr 16, 2015 at 9:15 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 7:09 AM, Steven White wrote:
  I cannot use escapeQueryChars method because my app interacts with Solr
 via
  REST.
 
  The summary of your email is: client's must escape search string to
 prevent
  Solr from failing.
 
  It would be a nice addition to Solr to provide a new query parameter that
  tells it to treat the query text as literal text.  Doing so, means you
  remove the burden placed on clients to understand and escape reserved
 Solr
  / Lucene tokens.

 That's a good idea, although we might already have that.

 I wonder what happens if you include defType=term with your request?
 That works for edismax, it might work for other query parsers, at least
 on the q parameter.

 Thanks,
 Shawn

custom search component on solrcloud

2015-04-16 Thread Robust Links

Hi

Apologize for sending this again. I am trying to port my none solrcloud
custom search handler to a solrcloud one. I have read the
WritingDistibutedSearchComponents
http://wiki.apache.org/solr/WritingDistributedSearchComponents wiki page
and looked at Terms and Querycomponent codes but the control flow of
execution is still fuzzy (even given the “distributed algorithm”
description).

Concretely, I have a none solrcloud algorithm that given a sequence of
tokens T would

1- split T into single tokens

2- foreach token t_i

get all the DocList for t_i by executing rb.req.getSearcher().getDocList in
process() method of the custom search component

3- do some magic on the collection of doclists

My question is how can i

1) do the splitting (step 1 above) in a single shard, and

2) distribute the getDocList for each token t_i to all shards

3) wait till i have all the doclists from all shards, then

4) do something with the results, in the original calling shard (step 1
above).


Thank you for your help

Re: Spurious _version_ conflict?

2015-04-16 Thread Chris Hostetter


: I notice that the expected value in the error message matches both what 
: I pass in and the index contents.  But the actual value in the error 
: message is different only in the last (low order) two digits.  
: Consistently.

what does your client code look like?  Are you sure you aren't being bit 
by a JSON parsing library that can't handle long values and winds up 
truncating them?

https://issues.apache.org/jira/browse/SOLR-6364



-Hoss
http://www.lucidworks.com/

Re: Differentiating user search term in Solr

2015-04-16 Thread Chris Hostetter

: The summary of your email is: client's must escape search string to prevent
: Solr from failing.
:
: It would be a nice addition to Solr to provide a new query parameter that
: tells it to treat the query text as literal text. Doing so, means you
: remove the burden placed on clients to understand and escape reserved Solr
: / Lucene tokens.

i'm a little lost as to what exactly you want to do here -- but i'm going
to focus on your thesis statement here, and assume that you want to
search on a literal piece of text and you don't want to have to worry
about escaping any characters and you don't wantsolr to treat any part of
the query string as special.

the only way something like that works is if you only want to search a
single field -- searching multiple fields, searching multiple clauses,
etc... none of those types of options make sense in this context.

people have already mentioned the term parser -- which is fine ifyou
want to serach for exactly one literal term, but as a more generally
solution, what people usualy want, is the field parser -- which works
better with TextFields in general...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser

Just like the comment you've seen about the term parser needing an f
localparam to specify the field, the same is true for the field parser.
but variable refrences make this trivial to specify -- instead of using
the full {!field f=myfield}Foo Bar syntax in your q param, you can use
an alternate param (qq is common in many examples) for the raw data from
the user...

q={!field f=myfield v=$qq} qq=whatever your usertypes

https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries

-Hoss
http://www.lucidworks.com/

Re: Solr 5.x deployment in production

Thanks Karl.

In my case, I have to deploy Solr on Windows, AIX, and Linux (all server
edition).  We are a WebSphere shop, moving away from it means I have to
deal with politics and culture.

For Windows, I cannot use NSSM so I have to figure a solution managing Solr
(at least start-up and shutdown).  If anyone has experience in this area
(now that Solr is not in a WAS profile managed by Windows services) and can
share your experience, please do.  Thanks.

Steve

On Thu, Apr 16, 2015 at 3:49 PM, Karl Kildén karl.kil...@gmail.com wrote:

 I asked a very similar question recently. You should switch to using the
 package as is and forget that it contains a .war. The war is now an
 internal component. Also switch to the new script for startup etc.

 I have seen several disappointed users that disagree with this decision but
 I assume the project now has more freedom in the future and also more
 alignment and focus on one experience.

 I did my own thing with NSSM because we use windows and I am satisfied.

 On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote:

  Hi folks,
 
  With Solr 5.0, the WAR file is deprecated and I see Jetty is included
 with
  Solr.  What if I have my own Web server into which I need to deploy Solr,
  how do I go about doing this correctly without messing things up and
 making
  sure Solr works?  Or is this not recommended and Jetty is the way to go,
 no
  questions asked?
 
  Thanks
 
  Steve

Re: 1:M connectivity

2015-04-16 Thread Oded Sofer

Right, we are using that. 
The issue is the firewall setting needed for the cloud. We do not want to open 
all nodes to all others nodes. However, we found that add-index to a specific 
node tries to access all other nodes though we set it to index locally on that 
node only. 


On Apr 16, 2015 7:19 PM, Erick Erickson erickerick...@gmail.com wrote:

 You say the SolrCloud API. Not entirely sure what that is, do you 
 mean the post.jar tool? 

 Because to get much more scalable throughput, you probably want to use SolrJ 
 and 
 the CloudSolrServer class. That class takes a connection to Zookeeper and 
 does the right thing. 

 Best, 
 Erick 

 On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid 
 wrote: 
  Given that the index are always routed to specific machine, is there a way 
  to avoid connectivity from the node to all other node. 
  We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
  added to the node that get API request for add-index (i.e., we are sending 
  the add index to the appropriate node that should get it).

Spurious _version_ conflict?

2015-04-16 Thread Reitzel, Charles

Hi All,

I have been getting intermittent 409 conflict responses to updates.  I check 
and double-check that the _version_ I am passing in matches the current value 
in the index.

I notice that the expected value in the error message matches both what I pass 
in and the index contents.  But the actual value in the error message is 
different only in the last (low order) two digits.   Consistently.

I noticed a similar report a while back:
http://lucene.472066.n3.nabble.com/Version-Conflict-on-Atomic-Update-td4083587.html

Any  thoughts?

Thanks,
Charlie

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop

Thanks Kaushik  Erick..

Though I can populate uuid by using combination of fields but need to
change the type to string else it throws Invalid UUID String
field name=uuid type=string indexed=true stored=true
required=true multiValued=false/

a) I will have ~80 millions records and wondering if performance might be
issue
b) So, during update I can still use combination of fields i.e. uuid ?

On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com
wrote:

 This seems relevant:


 http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid

 Best,
 Erick

 On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote:
  You seem to have defined the field, but not populating it in the query.
 Use
  a combination of fields to come up with a unique id that can be assigned
 to
  uuid. Does that make sense?
 
  Kaushik
 
  On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
  wrote:
 
  How to generate uuid/ id (maybe in data-config.xml...) for table which
 do
  not have any primary key.
 
  Scenario :
  Using DIH I need to import data from database but table does not have
 any
  primary key
  I do have uuid defined in schema.xml and is
  field name=uuid type=uuid indexed=true stored=true
 required=true
  multiValued=false/
  uniqueKeyuuid/uniqueKey
 
  data-config.xml
  ?xml version=1.0 encoding=UTF-8 ?
  dataConfig
  dataSource
batchSize=2000
name=test
type=JdbcDataSource
driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@ldap:
user=myUser
password=pwd/
  document
  entity name=test_entity
docRoot=true
dataSource=test
query=select name, age from test_user
  /entity
  /document
  /dataConfig
 
  Error : Document is missing mandatory uniqueKey field: uuid

Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop

Just wondering if there is a way to generate uuid/ id in data-config
without using combination of fields in query...

data-config.xml
?xml version=1.0 encoding=UTF-8 ?
dataConfig
dataSource
  batchSize=2000
  name=test
  type=JdbcDataSource
  driver=oracle.jdbc.OracleDriver
  url=jdbc:oracle:thin:@ldap:
  user=myUser
  password=pwd/
document
entity name=test_entity
  docRoot=true
  dataSource=test
  query=select name, age from test_user
/entity
/document
/dataConfig

On Thu, Apr 16, 2015 at 3:18 PM, Vishal Swaroop vishal@gmail.com
wrote:

 Thanks Kaushik  Erick..

 Though I can populate uuid by using combination of fields but need to
 change the type to string else it throws Invalid UUID String
 field name=uuid type=string indexed=true stored=true
 required=true multiValued=false/

 a) I will have ~80 millions records and wondering if performance might be
 issue
 b) So, during update I can still use combination of fields i.e. uuid ?

 On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 This seems relevant:


 http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid

 Best,
 Erick

 On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote:
  You seem to have defined the field, but not populating it in the query.
 Use
  a combination of fields to come up with a unique id that can be
 assigned to
  uuid. Does that make sense?
 
  Kaushik
 
  On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
  wrote:
 
  How to generate uuid/ id (maybe in data-config.xml...) for table which
 do
  not have any primary key.
 
  Scenario :
  Using DIH I need to import data from database but table does not have
 any
  primary key
  I do have uuid defined in schema.xml and is
  field name=uuid type=uuid indexed=true stored=true
 required=true
  multiValued=false/
  uniqueKeyuuid/uniqueKey
 
  data-config.xml
  ?xml version=1.0 encoding=UTF-8 ?
  dataConfig
  dataSource
batchSize=2000
name=test
type=JdbcDataSource
driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@ldap:
user=myUser
password=pwd/
  document
  entity name=test_entity
docRoot=true
dataSource=test
query=select name, age from test_user
  /entity
  /document
  /dataConfig
 
  Error : Document is missing mandatory uniqueKey field: uuid

Solr 5.x deployment in production

Hi folks,

With Solr 5.0, the WAR file is deprecated and I see Jetty is included with
Solr.  What if I have my own Web server into which I need to deploy Solr,
how do I go about doing this correctly without messing things up and making
sure Solr works?  Or is this not recommended and Jetty is the way to go, no
questions asked?

Thanks

Steve

Re: Solr 5.x deployment in production

2015-04-16 Thread Karl Kildén

I asked a very similar question recently. You should switch to using the
package as is and forget that it contains a .war. The war is now an
internal component. Also switch to the new script for startup etc.

I have seen several disappointed users that disagree with this decision but
I assume the project now has more freedom in the future and also more
alignment and focus on one experience.

I did my own thing with NSSM because we use windows and I am satisfied.

On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote:

 Hi folks,

 With Solr 5.0, the WAR file is deprecated and I see Jetty is included with
 Solr.  What if I have my own Web server into which I need to deploy Solr,
 how do I go about doing this correctly without messing things up and making
 sure Solr works?  Or is this not recommended and Jetty is the way to go, no
 questions asked?

 Thanks

 Steve

Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood

Turning PDF back into a structured document is like trying to turn hamburger 
back into a cow.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. talli...@mitre.org wrote:

 +1 
 
 :)
 
 PS: one more thing - please, tell your management that you will never 
 ever successfully all real-world PDFs and cater for that fact in your 
 requirements :-)

SolrCloud Core Reload

2015-04-16 Thread Vincenzo D'Amore

Hi all,

I have a solrcloud cluster with 3 server and there are many cores.
Using the SolrCloud UI Admin Core, if I execute core optimize (or
reload), all the core in the cluster will be optimized or reloaded? or
only the selected core?.

Best regards,
Vincenzo

Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe

This looks like a bug. The logic to merge range facets from shards seems to
only be merging counts, not the first level elements.
Could you create a Jira?

On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example scripts
 and data. I am running on Windows and tested both Solr 5.0.0 and 5.1.0.
 This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will

SolrJ Exceptions

2015-04-16 Thread Bryan Bende

I'm trying to identify the difference between an exception when Solr is in
a bad state/down vs. when it is up but an invalid request was made (maybe
some bad data sent in).

The JavaDoc for SolrRequest process() says:


*@throws SolrServerException if there is an error on the Solr server@throws
IOException if there is a communication error*

So I expected IOException when Solr was down, but it looks like it actually
throws a SolrServerException which has a cause of an IOException.

I'm also not sure how SolrException fits into all of this...

Is anyone familiar with when to generally expect these types of exceptions?

I'm interested in both cloud and stand-alone scenarios, and using Solr 5.0
or 5.1.

Thanks,

Bryan

SolrCloud 4.8.0 upgrade

2015-04-16 Thread Vincenzo D'Amore

Hi All,

I have a SolrCloud cluster with 3 server, I would like to use stats.facet,
but this feature is available only if I upgrade to 4.10.

May I simply redeploy new solr cloud version in tomcat or should reload all
the documents?
There are other drawbacks?

Best regards,
Vincenzo

Re: 5.1 'unique' facet function / calcDistinct

2015-04-16 Thread Yonik Seeley

Thanks for the feedback Levan!
Could you open a JIRA issue for unique() on numeric/date fields?
We don't yet have explicit numeric support for unique() and I think
some changes in Lucene 5 broke treating these fields as strings (i.e.
the ability to retrieve ords).

-Yonik


On Thu, Apr 16, 2015 at 7:46 AM, levanDev levandev9...@gmail.com wrote:
 Hello,

 We are looking at a couple of options for using solr to dynamically calulate
 unique values per field. In testing out Solr 5.1, I've been using the
 unique() facet function:

 http://yonik.com/solr-facet-functions/

 Overall, loving the JSON Facet API, especially the sub-faceting thus far.

 Here's my two part question:

 I. When I use the unique aggregation function on a string field
 (uniqueValues:'unique(myStringField)'), it works as expected, returns the
 number of unique fields. However when I pass in an int -- or date -- field
 (uniqueValues:'unique(myIntField)') the resulting count is 0. The cause
 might be something else, but if it can be replicated by another user, would
 be great to discuss the unique function further -- in our current use-case,
 we have a field where under 20 unique values are present but the values are
 ints.

 II. Is there a way to use the stats.calcdistinct functionality and only
 return the countDistinct portion of the response and not the full list of
 distinct values -- as provided in the distinctValues portion of the
 response. In a field with high cardinality the response size becomes too
 large.

 If there is no such option, could someone point me in the right direction
 for implementing a custom solution?

 Thank you for your time,
 Levan

Re: Differentiating user search term in Solr

Hi Hoss,

Maybe I'm missing something, but I tried this and got 1 hit:


http://localhost:8983/solr/db/select?q=title:(Apache%20Solr%20Notes)fl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND

Than I tried this and got 0 hit:


http://localhost:8983/solr/db/select?q={!field%20f=title%20v=$qq}qq=Apache%20Solr%20Notesfl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND

It looks to me that f with qq is doing phrase search, that's not what I
want.  The data in the field title is Apache Solr Release Notes

I looked over the links you provided and tried out the examples, in each
case if the user-typed-text contains any reserved characters, it will fail
with a syntax error (the exception is when I used f and qq but like I
said, that gave me 0 hit).

If you can give me a concrete example, please do.  My need is to pass to
Solr the text Apache: Solr Notes (without quotes) and get a hit as if I
passed Apache\: Solr Notes ?

Thanks

Steve

On Thu, Apr 16, 2015 at 5:49 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : The summary of your email is: client's must escape search string to
 prevent
 : Solr from failing.
 :
 : It would be a nice addition to Solr to provide a new query parameter that
 : tells it to treat the query text as literal text.  Doing so, means you
 : remove the burden placed on clients to understand and escape reserved
 Solr
 : / Lucene tokens.

 i'm a little lost as to what exactly you want to do here -- but i'm going
 to focus on your thesis statement here, and assume that you want to
 search on a literal piece of text and you don't want to have to worry
 about escaping any characters and you don't wantsolr to treat any part of
 the query string as special.

 the only way something like that works is if you only want to search a
 single field -- searching multiple fields, searching multiple clauses,
 etc... none of those types of options make sense in this context.

 people have already mentioned the term parser -- which is fine ifyou
 want to serach for exactly one literal term, but as a more generally
 solution, what people usualy want, is the field parser -- which works
 better with TextFields in general...


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser

 Just like the comment you've seen about the term parser needing an f
 localparam to specify the field, the same is true for the field parser.
 but variable refrences make this trivial to specify -- instead of using
 the full {!field f=myfield}Foo Bar syntax in your q param, you can use
 an alternate param (qq is common in many examples) for the raw data from
 the user...

 q={!field f=myfield v=$qq}  qq=whatever your usertypes



 https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries


 -Hoss
 http://www.lucidworks.com/

Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe

Should be fixed in 5.2. See https://issues.apache.org/jira/browse/SOLR-7412

On Thu, Apr 16, 2015 at 3:18 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 This looks like a bug. The logic to merge range facets from shards seems
 to only be merging counts, not the first level elements.
 Could you create a Jira?

 On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example
 scripts and data. I am running on Windows and tested both Solr 5.0.0 and
 5.1.0. This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will

generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop

How to generate uuid/ id (maybe in data-config.xml...) for table which do
not have any primary key.

Scenario :
Using DIH I need to import data from database but table does not have any
primary key
I do have uuid defined in schema.xml and is
field name=uuid type=uuid indexed=true stored=true required=true
multiValued=false/
uniqueKeyuuid/uniqueKey

data-config.xml
?xml version=1.0 encoding=UTF-8 ?
dataConfig
dataSource
  batchSize=2000
  name=test
  type=JdbcDataSource
  driver=oracle.jdbc.OracleDriver
  url=jdbc:oracle:thin:@ldap:
  user=myUser
  password=pwd/
document
entity name=test_entity
  docRoot=true
  dataSource=test
  query=select name, age from test_user
/entity
/document
/dataConfig

Error : Document is missing mandatory uniqueKey field: uuid

Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Kaushik

You seem to have defined the field, but not populating it in the query. Use
a combination of fields to come up with a unique id that can be assigned to
uuid. Does that make sense?

Kaushik

On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
wrote:

 How to generate uuid/ id (maybe in data-config.xml...) for table which do
 not have any primary key.

 Scenario :
 Using DIH I need to import data from database but table does not have any
 primary key
 I do have uuid defined in schema.xml and is
 field name=uuid type=uuid indexed=true stored=true required=true
 multiValued=false/
 uniqueKeyuuid/uniqueKey

 data-config.xml
 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
 dataSource
   batchSize=2000
   name=test
   type=JdbcDataSource
   driver=oracle.jdbc.OracleDriver
   url=jdbc:oracle:thin:@ldap:
   user=myUser
   password=pwd/
 document
 entity name=test_entity
   docRoot=true
   dataSource=test
   query=select name, age from test_user
 /entity
 /document
 /dataConfig

 Error : Document is missing mandatory uniqueKey field: uuid

Re: generate uuid/ id for table which do not have any primary key