facets on external field

2015-04-16 Thread jainam vora
Hi,

I am using external field for price field since it changes frequently.
generate facets using external field? how?

I understand that faceting requires indexing and external fields fields are
not actually indexed.

Is there any solution for this problem?


-- 
Thanks  Regards,
Jainam Vora


Re: Information regarding This conf directory is not valid SolrException.

2015-04-16 Thread Shai Erera
I opened SOLR-7408 to track that.

Shai

On Mon, Apr 13, 2015 at 3:31 PM, Bar Weiner weiner@gmail.com wrote:

 After some additional debugging, I think that this issue is caused by a
 possible race condition introduced to ZkController in Solr-5.0.0.

 My concerns are around unregister(...) function in ZkController.
 In the current code, all cores are traversed and if one of the cores is
 using configLocation, configLocationis variable is cleared so that its not
 removed from confDirectoryListeners. A possible issue can occur if, after
 the list of cores is fetched, a new core is added. If this new core will
 use the same config, then traversing all cores will not find that the
 configuration is used by another core, and it will be removed from
 confDirectoryListeners even though its still needed.

 In addition, when adding a watch to configuration in watchZKConfDir(..)
 function, no lock is used on confDirectoryListeners like in any other place
 where this map is accessed.

 A possible solution for this issue:
 - Add synchronized (confDirectoryListeners) to watchZKConfDir(..).
 - In unregister(...) function, traverse the list of cores twice. Before the
 first loop, obtain a lock on confDirectoryListeners, then look if any core
 is using configLocation, then remove configLocation from
 confDirectoryListeners if needed. Then the lock should be released. The
 second loop will be used for the rest of the code.

 I will be glad for any input, is this a real issue or did i miss something?
 Is the suggested solution valid?

 Thanks,
 Bar



 2015-04-01 18:16 GMT+03:00 Bar Weiner weiner@gmail.com:

  Hi,
 
  I'm working on upgrading a project from solr-4.10.3 to solr-5.0.0.
  As part of our JUnit tests we have a few tests for deleting/creating
  collections. Each test createdelete a collection with a different name,
  but they all share the same config in ZK.
  When running these tests in Eclipse everything works fine, but when
  running the same tests through Maven we get the following error so I
  suspect this is a timing related issue :
 
  INFO  org.apache.solr.rest.ManagedResourceStorage  – Setting up
  ZooKeeper-based storage for the RestManager with znodeBase:
  /configs/SIMPLE_CONFIG
  INFO  org.apache.solr.rest.ManagedResourceStorage  – Configured
  ZooKeeperStorageIO with znodeBase: /configs/SIMPLE_CONFIG
  INFO  org.apache.solr.rest.RestManager  – Initializing RestManager with
  initArgs: {}
  INFO  org.apache.solr.rest.ManagedResourceStorage  – Reading
  _rest_managed.json using ZooKeeperStorageIO:path=/configs/SIMPLE_CONFIG
  INFO  org.apache.solr.rest.ManagedResourceStorage  – No data found for
  znode /configs/SIMPLE_CONFIG/_rest_managed.json
  INFO  org.apache.solr.rest.ManagedResourceStorage  – Loaded null at path
  _rest_managed.json using ZooKeeperStorageIO:path=/configs/SIMPLE_CONFIG
  INFO  org.apache.solr.rest.RestManager  – Initializing 0 registered
  ManagedResources
  INFO  org.apache.solr.handler.ReplicationHandler  – Commits will be
  reserved for  1
  INFO  org.apache.solr.core.SolrCore  – [mycollection1] Registered new
  searcher Searcher@3208a6c4[mycollection1]
  main{ExitableDirectoryReader(UninvertingDirectoryReader())}
  ERROR org.apache.solr.core.CoreContainer  – Error creating core
  [mycollection1]: This conf directory is not valid
  org.apache.solr.common.SolrException: This conf directory is not valid
  at
 
 org.apache.solr.cloud.ZkController.registerConfListenerForCore(ZkController.java:2229)
  at
  org.apache.solr.core.SolrCore.registerConfListener(SolrCore.java:2633)
  at org.apache.solr.core.SolrCore.init(SolrCore.java:936)
  at org.apache.solr.core.SolrCore.init(SolrCore.java:662)
  at
  org.apache.solr.core.CoreContainer.create(CoreContainer.java:513)
  at
  org.apache.solr.core.CoreContainer.create(CoreContainer.java:488)
  at
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:573)
  at
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:197)
  at
 
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
  at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:736)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261)
  at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
  at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
  at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
  at
 
 

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
I entirely agree with Erick -- it is best to isolate Tika in its own jvm if you 
can -- bad things can happen if you don't [1] [2].

Erick's blog on SolrJ is fantastic.  If you want to have Tika parse embedded 
documents/attachments, make sure to set the parser in the ParseContext before 
parsing:

ParseContext context = new ParseContext();
//add this line:
context.set(Parser.class, _autoParser)
 InputStream input = new FileInputStream(file);

Tika 1.8 is soon to be released.  If that doesn't fix your problems, please 
submit stacktraces (and docs, if possible) to the Tika jira, and we'll try to 
make the fixes.  

Cheers,

Tim

[1] http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf 
[2] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 
-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to Solr, in
 a
 production environment the Solr server is responsible for indexing,
 parsing the
 docs through Tika, perhaps searching etc. This doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally independent
 of
 what version of Tika is on the Solr server. Here's an example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 vijaya.bhoomire...@whishworks.com wrote:
  Thanks everyone for the responses. Now I am able to index PDF documents
  successfully. I have implemented manual extraction using Tika's
 AutoParser
  and PDF functionality is working fine. However,  the error with some MS
  office word documents still persist.
 
  The error message is java.lang.IllegalArgumentException: This paragraph
 is
  not the first one in the table which will eventually result in
 Unexpected
  RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
 
  Upon some reading, it looks like its a bug with Tika 1.5 and seems to
 have
  been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
  I am new to Solr / Tika and hence wondering whether I can change the Tika
  library alone to v1.6 without impacting any of the libraries within Solr
  4.10.2? Please let me know your response and how to get away with this
  issue.
 
  Many thanks in advance.
 
  Thanks  Regards
  Vijay
 
 
  On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
 
  Vijay,
 
  You could try different excel files with different formats to rule out
 the
  issue is with TIKA version being used.
 
  Thanks
  Murthy
 
  On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
  wrote:
 
   Perhaps the PDF is protected and the content can not be extracted?
  
   i have an unverified suspicion that the tika shipped with solr 4.10.2
 may
   not support some/all office 2013 document formats.
  
  
  
  
  
   On 4/14/2015 8:18 PM, Jack Krupansky wrote:
  
   Try doing a manual extraction request directly to Solr (not via
 SolrJ)
  and
   use the extractOnly option to see if the content is actually
 extracted.
  
   See:
   https://cwiki.apache.org/confluence/display/solr/
   Uploading+Data+with+Solr+Cell+using+Apache+Tika
  
   Also, some PDF files actually have the content as a bitmap image, so
 no
   text is extracted.
  
  
   -- Jack Krupansky
  
   On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy
 
   vijaya.bhoomire...@whishworks.com wrote:
  
Hi,
  
   I am trying to index PDF and Microsoft Office files (.doc, .docx,
 .ppt,
   .pptx, .xlx, and .xlx) files into Solr. I am facing the following
  issues.
   Request to please let me know what is going wrong with the indexing
   process.
  
   I am using solr 4.10.2 and using the default example server
  configuration
   that comes with Solr distribution.
  
   PDF Files - Indexing as such works fine, but when I query using *.*
 in
   the
   Solr Query console, metadata information is displayed properly.
  However,
   the PDF content field is empty. This is happening for all PDF files
 

Merge indexes in MapReduce

2015-04-16 Thread Norgorn
Is there a ready-to-use tool to merge existing indexes in map-reduce?
We have real-time search and want to merge (and optimize) its indexes into
one, so we don't need to build index in Map-Reduce, but only merge it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Merge-indexes-in-MapReduce-tp4200106.html
Sent from the Solr - User mailing list archive at Nabble.com.


5.1 'unique' facet function / calcDistinct

2015-04-16 Thread levanDev
Hello, 

We are looking at a couple of options for using solr to dynamically calulate
unique values per field. In testing out Solr 5.1, I've been using the
unique() facet function:

http://yonik.com/solr-facet-functions/

Overall, loving the JSON Facet API, especially the sub-faceting thus far. 

Here's my two part question:

I. When I use the unique aggregation function on a string field
(uniqueValues:'unique(myStringField)'), it works as expected, returns the
number of unique fields. However when I pass in an int -- or date -- field
(uniqueValues:'unique(myIntField)') the resulting count is 0. The cause
might be something else, but if it can be replicated by another user, would
be great to discuss the unique function further -- in our current use-case,
we have a field where under 20 unique values are present but the values are
ints.

II. Is there a way to use the stats.calcdistinct functionality and only
return the countDistinct portion of the response and not the full list of
distinct values -- as provided in the distinctValues portion of the
response. In a field with high cardinality the response size becomes too
large. 

If there is no such option, could someone point me in the right direction
for implementing a custom solution?

Thank you for your time,
Levan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/5-1-unique-facet-function-calcDistinct-tp4200110.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
This sounds like a Tika issue, let's move discussion to that list.

If you are still having problems after you upgrade to Tika 1.8, please at least 
submit the stack traces (if you can) to the Tika jira.  We may be able to find 
a document that triggers that stack trace in govdocs1 or the slice of 
CommonCrawl that Julien Nioche contributed to our eval effort.

Tika is not perfect and it will fail on some files, but we are always working 
to improve it.

Best,

  Tim

-Original Message-
From: Vijaya Narayana Reddy Bhoomi Reddy 
[mailto:vijaya.bhoomire...@whishworks.com] 
Sent: Thursday, April 16, 2015 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks  Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

 I entirely agree with Erick -- it is best to isolate Tika in its own jvm
 if you can -- bad things can happen if you don't [1] [2].

 Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
 embedded documents/attachments, make sure to set the parser in the
 ParseContext before parsing:

 ParseContext context = new ParseContext();
 //add this line:
 context.set(Parser.class, _autoParser)
  InputStream input = new FileInputStream(file);

 Tika 1.8 is soon to be released.  If that doesn't fix your problems,
 please submit stacktraces (and docs, if possible) to the Tika jira, and
 we'll try to make the fixes.

 Cheers,

 Tim

 [1]
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
 [2]
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
 are getting parsed properly and indexed into Solr. However, a minority of
 them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents can be
 indexed. The business requirement we have is to index all the documents.
 However, if a small percentage of them fails, not sure what other ways
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

  There's quite a discussion here:
  https://issues.apache.org/jira/browse/SOLR-7137
 
  But, I personally am not a huge fan of pushing all the work on to Solr,
 in
  a
  production environment the Solr server is responsible for indexing,
  parsing the
  docs through Tika, perhaps searching etc. This doesn't scale all that
 well.
 
  So an alternative is to use SolrJ with Tika, which is totally independent
  of
  what version of Tika is on the Solr server. Here's an example.
 
  http://lucidworks.com/blog/indexing-with-solrj/
 
  Best,
  Erick
 
  On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
  vijaya.bhoomire...@whishworks.com wrote:
   Thanks everyone for the responses. Now I am able to index PDF documents
   successfully. I have implemented manual extraction using Tika's
  AutoParser
   and PDF functionality is working fine. However,  the error with some MS
   office word documents still persist.
  
   The error message is java.lang.IllegalArgumentException: This
 paragraph
  is
   not the first one in the table which will eventually result in
  Unexpected
   RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
  
   Upon some reading, it looks like its a bug with Tika 1.5 and seems to
  have
   been fixed with Tika 1.6 (
  https://issues.apache.org/jira/browse/TIKA-1251 ).
   I am new to Solr / Tika and hence wondering whether I can change the
 Tika
   library alone to v1.6 without impacting any of the libraries within
 Solr
   4.10.2? Please let me know your response and how to get away with this
   issue.
  
   Many thanks in advance.
  
   Thanks  Regards
   Vijay
  
  
   On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
  
   Vijay,
  
   You could try different excel files with different formats to rule out
  the
   issue is with TIKA version being used.
  
   Thanks
   Murthy
  
   On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
   wrote:
  

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
+1 

:)

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)



Nno servers hosting shard.

2015-04-16 Thread Modassar Ather
Hi,

I have a setup of 5 node SolrCloud (Lucene/Solr version 5.1.0) without
replicas. When I am executing complex and large queries with wild-cards
after some time I am getting following exceptions.
The index size on each of the node is around 170GB and the memory is set to
-Xms20g -Xmx24g on each node.

Empty shard!
org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:214)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:184)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

There is no OutofMemory or any other major lead for me to understand what
had caused it. May be I am missing something. There are following other
exceptions:

SEVERE: null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while
waiting response from server at: http://server:8080/solr/collection
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:193)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


WARNING: listener throws error
org.apache.solr.common.SolrException:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/collection/params.json
at
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:163)
at
org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:919)
at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2500)
at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2366)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/collection/params.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
at
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
at
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:153)
... 3 more

The Zookeeper session timeout is set to 3. In the log file I can see
logs of the following pattern for all the queries I fired.
INFO: [collection] webapp=/solr path=/search_handler
params={sort=score+descstart=0q=(ft:search term)} status=0 QTime=time
If I am not wrong they are getting executed but somehow as the shard is
gone down which I can see in /clusterstate.json under the log, the search
is 

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to Solr, in
 a
 production environment the Solr server is responsible for indexing,
 parsing the
 docs through Tika, perhaps searching etc. This doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally independent
 of
 what version of Tika is on the Solr server. Here's an example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 vijaya.bhoomire...@whishworks.com wrote:
  Thanks everyone for the responses. Now I am able to index PDF documents
  successfully. I have implemented manual extraction using Tika's
 AutoParser
  and PDF functionality is working fine. However,  the error with some MS
  office word documents still persist.
 
  The error message is java.lang.IllegalArgumentException: This paragraph
 is
  not the first one in the table which will eventually result in
 Unexpected
  RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
 
  Upon some reading, it looks like its a bug with Tika 1.5 and seems to
 have
  been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
  I am new to Solr / Tika and hence wondering whether I can change the Tika
  library alone to v1.6 without impacting any of the libraries within Solr
  4.10.2? Please let me know your response and how to get away with this
  issue.
 
  Many thanks in advance.
 
  Thanks  Regards
  Vijay
 
 
  On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
 
  Vijay,
 
  You could try different excel files with different formats to rule out
 the
  issue is with TIKA version being used.
 
  Thanks
  Murthy
 
  On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
  wrote:
 
   Perhaps the PDF is protected and the content can not be extracted?
  
   i have an unverified suspicion that the tika shipped with solr 4.10.2
 may
   not support some/all office 2013 document formats.
  
  
  
  
  
   On 4/14/2015 8:18 PM, Jack Krupansky wrote:
  
   Try doing a manual extraction request directly to Solr (not via
 SolrJ)
  and
   use the extractOnly option to see if the content is actually
 extracted.
  
   See:
   https://cwiki.apache.org/confluence/display/solr/
   Uploading+Data+with+Solr+Cell+using+Apache+Tika
  
   Also, some PDF files actually have the content as a bitmap image, so
 no
   text is extracted.
  
  
   -- Jack Krupansky
  
   On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy
 
   vijaya.bhoomire...@whishworks.com wrote:
  
Hi,
  
   I am trying to index PDF and Microsoft Office files (.doc, .docx,
 .ppt,
   .pptx, .xlx, and .xlx) files into Solr. I am facing the following
  issues.
   Request to please let me know what is going wrong with the indexing
   process.
  
   I am using solr 4.10.2 and using the default example server
  configuration
   that comes with Solr distribution.
  
   PDF Files - Indexing as such works fine, but when I query using *.*
 in
   the
   Solr Query console, metadata information is displayed properly.
  However,
   the PDF content field is empty. This is happening for all PDF files
 I
   have
   tried. I have tried with some proprietary files, PDF eBooks etc.
  Whatever
   be the PDF file, content is not being displayed.
  
   MS Office files -  For some office files, everything works perfect
 and
   the
   extracted content is visible in the query console. However, for
  others, I
   see the below error message during the indexing process.
  
   *Exception in thread main
  
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
   org.apache.tika.exception.TikaException: Unexpected RuntimeException
   from
   org.apache.tika.parser.microsoft.OfficeParser*
  
  
   I am using SolrJ to index the documents and below is the code
 snippet
   related to indexing. Please let me know where the issue is
 occurring.
  
static String solrServerURL = 
   http://localhost:8983/solr;;
   static SolrServer solrServer = new HttpSolrServer(solrServerURL);
static ContentStreamUpdateRequest
 indexingReq
  =
   new
  
ContentStreamUpdateRequest(/update/extract);
  
   

Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)


Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)


On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This paragraph

is

not the first one in the table which will eventually result in

Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy



vijaya.bhoomire...@whishworks.com wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread main


org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:

org.apache.tika.exception.TikaException: Unexpected RuntimeException
from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code

snippet

related to 

check If I am Still Leader

2015-04-16 Thread Adir Ben Ami

Hi,

I am using Solr 4.10.0 with tomcat and embedded Zookeeper.
I use SolrCloud in my system.

Each Shard machine try to reach/connect with other cluster machines in order to 
index the document ,it just checks if it is still the leader.
 I don't use replication so why does it has to check who is the leader?
How can I bypass this constraint and make my solrcloud not use 
ClusterStateUpdater.checkIfIamStillLeader when i am indexing?

Thanks,
Adir.   
  

Escaping in update XML messages

2015-04-16 Thread Jens Brandt
Hi,

I am trying to delete some documents from my index by posting XML-messages to 
the solr. The unique key for the documents in my index is their url. The XML 
messages look like this:

deletequeryurl:http://example.com/path/file;/query/delete

For simple urls everything works fine, but if the url contains an '' like this:

deletequeryurl:http://example.com/path/file?a=foob=bar;/query/delete

an error occurs because the XML is not valid:

com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '=' (code 
61); expected a semi-colon after the reference for entity 'b'

Escaping '' by using 'amp;' does not help, because the query

deletequeryurl:http://example.com/path/file?a=fooamp;b=bar;/query/delete

does not match the url in my index.

How do I need to escape or encode the url in the XML message?

Thank you!
  Jens






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.

These files normally open in Adobe Reader and MS Office tools.

Thanks  Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

 I entirely agree with Erick -- it is best to isolate Tika in its own jvm
 if you can -- bad things can happen if you don't [1] [2].

 Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
 embedded documents/attachments, make sure to set the parser in the
 ParseContext before parsing:

 ParseContext context = new ParseContext();
 //add this line:
 context.set(Parser.class, _autoParser)
  InputStream input = new FileInputStream(file);

 Tika 1.8 is soon to be released.  If that doesn't fix your problems,
 please submit stacktraces (and docs, if possible) to the Tika jira, and
 we'll try to make the fixes.

 Cheers,

 Tim

 [1]
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
 [2]
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
 are getting parsed properly and indexed into Solr. However, a minority of
 them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents can be
 indexed. The business requirement we have is to index all the documents.
 However, if a small percentage of them fails, not sure what other ways
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

  There's quite a discussion here:
  https://issues.apache.org/jira/browse/SOLR-7137
 
  But, I personally am not a huge fan of pushing all the work on to Solr,
 in
  a
  production environment the Solr server is responsible for indexing,
  parsing the
  docs through Tika, perhaps searching etc. This doesn't scale all that
 well.
 
  So an alternative is to use SolrJ with Tika, which is totally independent
  of
  what version of Tika is on the Solr server. Here's an example.
 
  http://lucidworks.com/blog/indexing-with-solrj/
 
  Best,
  Erick
 
  On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
  vijaya.bhoomire...@whishworks.com wrote:
   Thanks everyone for the responses. Now I am able to index PDF documents
   successfully. I have implemented manual extraction using Tika's
  AutoParser
   and PDF functionality is working fine. However,  the error with some MS
   office word documents still persist.
  
   The error message is java.lang.IllegalArgumentException: This
 paragraph
  is
   not the first one in the table which will eventually result in
  Unexpected
   RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
  
   Upon some reading, it looks like its a bug with Tika 1.5 and seems to
  have
   been fixed with Tika 1.6 (
  https://issues.apache.org/jira/browse/TIKA-1251 ).
   I am new to Solr / Tika and hence wondering whether I can change the
 Tika
   library alone to v1.6 without impacting any of the libraries within
 Solr
   4.10.2? Please let me know your response and how to get away with this
   issue.
  
   Many thanks in advance.
  
   Thanks  Regards
   Vijay
  
  
   On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:
  
   Vijay,
  
   You could try different excel files with different formats to rule out
  the
   issue is with TIKA version being used.
  
   Thanks
   Murthy
  
   On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
   wrote:
  
Perhaps the PDF is protected and the content can not be extracted?
   
i have an unverified suspicion that the tika shipped with solr
 4.10.2
  may
not support some/all office 2013 document formats.
   
   
   
   
   
On 4/14/2015 8:18 PM, Jack Krupansky wrote:
   
Try doing a manual extraction request directly to Solr (not via
  SolrJ)
   and
use the extractOnly option to see if the content is actually
  extracted.
   
See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika
   
Also, some PDF files actually have the content as a bitmap image,
 so
  no
text is extracted.
   
   
-- Jack Krupansky
   
On Tue, Apr 

SolrCloud - Collection Browsing

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

I have setup a SolrCloud on 3 machines - machine1, machine2 and machine3.
The DirectoryFactory used is HDFS where the collection index data is stored
in HDFS within a Hadoop cluster.

SlorCloud has been setup successfully and everything looks fine so far. I
have uploaded the default configuration i.e. the conf folder under
example/collection1 folder under the solr installation directory into
Zookeeper. Essentially, I have uploaded the default configuration into
Zookeeper.

Now when I log in to Solr Admin using http://machine1:8983/solr/admin, I am
able to see the SolrAdmin page and when I click on Cloud, I could see all
the shards and replications properly in the browser.

However, the issue comes when I try to open the page
http://machine1:8983/solr/mycollection/browse. I am seeing a HTTP 500 lazy
loading error. This looks like a trivial mistake somewhere as the
collection is setup fine and everything works normal. However, when I
browse the collection, this error occurs. Even when I open
http://machine1:8983/solr/mycollection/query I am getting the json response
properly with numFound as 0

I was expecting similar behavior like how the /browse request provides the
Solritas page.

Note: I haven't changed any of the configuration in the conf directory.
Should I modify solrconfig.xml to have a RequestHandler for
/mycollection/browse or the default one be sufficient?

Can someone provide some pointers please to get this issue resolved?

Thanks  Regards
Vijay

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Tim.

I shall raise a Jira with the stack trace information.

Thanks  Regards
Vijay


On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

 This sounds like a Tika issue, let's move discussion to that list.

 If you are still having problems after you upgrade to Tika 1.8, please at
 least submit the stack traces (if you can) to the Tika jira.  We may be
 able to find a document that triggers that stack trace in govdocs1 or the
 slice of CommonCrawl that Julien Nioche contributed to our eval effort.

 Tika is not perfect and it will fail on some files, but we are always
 working to improve it.

 Best,

   Tim

 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Thanks Allison.

 I tried with the mentioned changes. But still no luck. I am using the code
 from lucidworks site provided by Erick and now included the changes
 mentioned by you. But still the issue persists with a small percentage of
 documents (both PDF and MS Office documents) failing. Unfortunately, these
 documents are proprietary and client-confidential and hence I am not sure
 whether they can be uploaded into Jira.

 These files normally open in Adobe Reader and MS Office tools.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org wrote:

  I entirely agree with Erick -- it is best to isolate Tika in its own jvm
  if you can -- bad things can happen if you don't [1] [2].
 
  Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
  embedded documents/attachments, make sure to set the parser in the
  ParseContext before parsing:
 
  ParseContext context = new ParseContext();
  //add this line:
  context.set(Parser.class, _autoParser)
   InputStream input = new FileInputStream(file);
 
  Tika 1.8 is soon to be released.  If that doesn't fix your problems,
  please submit stacktraces (and docs, if possible) to the Tika jira, and
  we'll try to make the fixes.
 
  Cheers,
 
  Tim
 
  [1]
 
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
  [2]
 
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
  -Original Message-
  From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
  vijaya.bhoomire...@whishworks.com]
  Sent: Thursday, April 16, 2015 7:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing PDF and MS Office files
 
  Erick,
 
  I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
  SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
  are getting parsed properly and indexed into Solr. However, a minority of
  them keep failing wither PDFParser or OfficeParser error.
 
  Not sure if this behaviour can be modified so that all the documents can
 be
  indexed. The business requirement we have is to index all the documents.
  However, if a small percentage of them fails, not sure what other ways
  exist to index them.
 
  Any help please?
 
 
  Thanks  Regards
  Vijay
 
 
 
  On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
 wrote:
 
   There's quite a discussion here:
   https://issues.apache.org/jira/browse/SOLR-7137
  
   But, I personally am not a huge fan of pushing all the work on to Solr,
  in
   a
   production environment the Solr server is responsible for indexing,
   parsing the
   docs through Tika, perhaps searching etc. This doesn't scale all that
  well.
  
   So an alternative is to use SolrJ with Tika, which is totally
 independent
   of
   what version of Tika is on the Solr server. Here's an example.
  
   http://lucidworks.com/blog/indexing-with-solrj/
  
   Best,
   Erick
  
   On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
   vijaya.bhoomire...@whishworks.com wrote:
Thanks everyone for the responses. Now I am able to index PDF
 documents
successfully. I have implemented manual extraction using Tika's
   AutoParser
and PDF functionality is working fine. However,  the error with some
 MS
office word documents still persist.
   
The error message is java.lang.IllegalArgumentException: This
  paragraph
   is
not the first one in the table which will eventually result in
   Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
   
Upon some reading, it looks like its a bug with Tika 1.5 and seems to
   have
been fixed with Tika 1.6 (
   https://issues.apache.org/jira/browse/TIKA-1251 ).
I am new to Solr / Tika and hence wondering whether I can change the
  Tika
library alone to v1.6 without impacting any of the libraries within
  Solr
4.10.2? Please let me know your response and how to get away with
 this
issue.
   
Many thanks in advance.
   
Thanks  Regards
Vijay
   
   
On 15 April 2015 at 

Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-16 Thread elisabeth benoit
For the records, what I finally did is place those words I want spellcheck
to ignore in spellcheck.collateParam.fq and the words I'd like to be
checked in spellcheck.q. collationQuery uses spellcheck.collateParam.fq so
all did_you_mean queries return results containing words in
spellcheck.collateParam.fq.

Best regards,
Elisabeth



2015-04-14 17:05 GMT+02:00 elisabeth benoit elisaelisael...@gmail.com:

 Thanks for your answer!

 I didn't realize this what not supposed to be done (conjunction of
 DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the
 mailing list while searching for a solution to get a list of words to
 ignore for the DirectSolrSpellChecker.

 Well well well, I'll try removing the check and see what happens. I'm not
 a java programmer, but if I can find a simple solution I'll let you know.

 Thanks again,
 Elisabeth

 2015-04-14 16:29 GMT+02:00 Dyer, James james.d...@ingramcontent.com:

 Elisabeth,

 Currently ConjunctionSolrSpellChecker only supports adding
 WordBreakSolrSpellchecker to IndexBased- FileBased- or
 DirectSolrSpellChecker.  In the future, it would be great if it could
 handle other Spell Checker combinations.  For instance, if you had a
 (e)dismax query that searches multiple fields, to have a separate
 spellchecker for each of them.

 But CSSC is not hardened for this more general usage, as hinted in the
 API doc.  The check done to ensure all spellcheckers use the same
 stringdistance object, I believe, is a safeguard against using this class
 for functionality it is not able to correctly support.  It looks to me that
 SOLR-6271 was opened to fix the bug in that it is comparing references on
 the stringdistance.  This is not a problem with WBSSC because this one does
 not support string distance at all.

 What you're hoping for, however, is that the requirement for the string
 distances be the same to be removed entirely.  You could try modifying the
 code by removing the check.  However beware that you might not get the
 results you desire!  But should this happen, please, go ahead and fix it
 for your use case and then donate the code.  This is something I've
 personally wanted for a long time.

 James Dyer
 Ingram Content Group


 -Original Message-
 From: elisabeth benoit [mailto:elisaelisael...@gmail.com]
 Sent: Tuesday, April 14, 2015 7:37 AM
 To: solr-user@lucene.apache.org
 Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr
 4.10.1

 Hello,

 I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
 FileBasedSpellchecker in same request.

 I've applied change from patch 135.patch (cf Solr-6271). I've tried
 running
 the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe
 because the patch was a fix to Solr 4.9, so I just replaced line in
 ConjunctionSolrSpellChecker

 else if (!stringDistance.equals(checker.getStringDistance())) {
  throw new IllegalArgumentException(
  All checkers need to use the same StringDistance.);
}


 by

 else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 All checkers need to use the same StringDistance!!! 1: +
 checker.getStringDistance() +  2:  + stringDistance);
   }

 as it was done in the patch

 but still, when I send a spellcheck request, I get the error

 msg: All checkers need to use the same StringDistance!!!
 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
 org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08

 From error message I gather both spellchecker use same distanceMeasure
 LuceneLevenshteinDistance, but they're not same instance of
 LuceneLevenshteinDistance.

 Is the condition all right? What should be done to fix this properly?

 Thanks,
 Elisabeth





Re: Differentiating user search term in Solr

2015-04-16 Thread Shawn Heisey
On 4/16/2015 7:09 AM, Steven White wrote:
 I cannot use escapeQueryChars method because my app interacts with Solr via
 REST.
 
 The summary of your email is: client's must escape search string to prevent
 Solr from failing.
 
 It would be a nice addition to Solr to provide a new query parameter that
 tells it to treat the query text as literal text.  Doing so, means you
 remove the burden placed on clients to understand and escape reserved Solr
 / Lucene tokens.

That's a good idea, although we might already have that.

I wonder what happens if you include defType=term with your request?
That works for edismax, it might work for other query parsers, at least
on the q parameter.

Thanks,
Shawn



Re: check If I am Still Leader

2015-04-16 Thread Shawn Heisey
On 4/16/2015 7:08 AM, Adir Ben Ami wrote:
 I am using Solr 4.10.0 with tomcat and embedded Zookeeper.
 I use SolrCloud in my system.

 Each Shard machine try to reach/connect with other cluster machines in order 
 to index the document ,it just checks if it is still the leader.
  I don't use replication so why does it has to check who is the leader?
 How can I bypass this constraint and make my solrcloud not use 
 ClusterStateUpdater.checkIfIamStillLeader when i am indexing?

You might not need that functionality, but Solr must address the general
case, which includes multiple replicas for each shard,where one of them
will be leader.

I hope this is a test installation ... running in production without
fault tolerance is a bad idea.  Using the embedded zookeeper in
production is another bad idea, for the same reason - fault tolerance.

You can file an issue in Jira for a configuration mode where the leader
check is disabled.  I would oppose having that happen automatically ...
another replica could be added to the cloud at any time.

Thanks,
Shawn



Re: check If I am Still Leader

2015-04-16 Thread Shawn Heisey
On 4/16/2015 7:42 AM, Adir Ben Ami wrote:
 I have not mentioned before that the index are always routed to specific 
 machine.
 Is there a way to avoid connectivity from the node to all other nodes? 

That capability has been added in Solr 5.1.0.

https://issues.apache.org/jira/browse/SOLR-6832

Thanks,
Shawn



Batch collecting in PostFilter

2015-04-16 Thread ha.pham
Hi all,

I am implementing a PostFilter following this article
https://lucidworks.com/blog/custom-security-filtering-in-solr/

We have a requirement to call the external system only once for all the 
documents (max 200) so below is my change:

-don't call super.collect(docId) in the collect method of the PostFilter but 
store all docIds in an internal map

-call the external system in the finish() then call super.collect(docId) for 
all the docs that pass the external filtering

The problem I have: docId exceeds maxDoc (docID must be = 0 and  
maxDoc=10 (got docID=123456)

I suspect I am storing local docIds and when Reader is changed, docBase is also 
changed so the global docId, which I believe is constructed in super.collect() 
using the parameter docId and docBase, becomes incorrect.

Could anyone point me to the right direction?

Thanks,

-Ha



Re: Differentiating user search term in Solr

2015-04-16 Thread Shawn Heisey
On 4/16/2015 7:49 AM, Steven White wrote:
 defType didn't work:


 http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene

 Gave me error:

 org.apache.solr.search.SyntaxError: Expected identifier at pos 27
 str='{!q.op=AND df=text solr sys'

 Is my use of defType correct?

If everything is at defaults and you don't have defType in the handler
definition, then defType=lucene doesn't do anything - it specifically
says use the lucene parser which is the default.  You want
defType=term instead.

Thanks,
Shawn



Re: How can I temporarily detach node from SolrCloud?

2015-04-16 Thread Shawn Heisey
On 4/16/2015 8:27 AM, Oded Sofer wrote:
 How can I detach node from SolrCloud (temporarily for maintenance and such 
 and attach it back after some time). We are using SolrCloud 4.10.0; One 
 Collection, and Shard per node. 
 The add-index is routed to specific machine base on our customize routing 
 logic (kind of hard-coded) 

I assume this is just one replica out of multiple ... if that's the
case, just shut the node down, do your maintenance, and bring it back
online.  SolrCloud will automatically make sure the index replica(s) on
the node are brought up to date to match the others.

If it's not one replica of multiple (that is, if it has the only copy of
one or more shards), then shutting it down will either reduce your
result set or cause queries to return an error, not sure which.

Thanks,
Shawn



Conditional Filter Queries

2015-04-16 Thread Tao, Jing
Hi,

I want to filter my search results by different date fields based on content 
type.
In other words: if contentType is A, filter out results that are older than 1 
year; if contentType is B, filter out results that are older than 2 years; 
otherwise, date does not matter.

Is that possible with fq parameters?
Would it be something like  fq=(contentType:A AND startDate:[NOW-1YEAR TO 
NOW]) OR (contentType:B AND startDate:[NOW-2YEAR TO NOW]) OR !contentType: 
(A or B)

Is there a better way to do this?

Thanks,
Jing


Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
What is term in the defType=term, do you mean the raw word term or
something else?  Because I tried that too in two different ways:

Using correct Solr syntax:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text}%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term

This throws a NPE exception:

java.lang.NullPointerException at

org.apache.solr.schema.IndexSchema$DynamicReplacement$DynamicPattern$NameEndsWith.matches(IndexSchema.java:1033)
at

org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:1047)
at
org.apache.solr.schema.IndexSchema.dynFieldType(IndexSchema.java:1303)
at

org.apache.solr.schema.IndexSchema.getFieldTypeNoEx(IndexSchema.java:1280)
at

org.apache.solr.search.TermQParserPlugin$1.parse(TermQParserPlugin.java:56)
at

And when I try it with invalid Solr search syntax:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=term


This gives me the SyntaxError:

org.apache.solr.search.SyntaxError: Expected identifier at pos 27
str='{!q.op=AND df=text solr sys'

What am I missing?

Steve

On Thu, Apr 16, 2015 at 10:43 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 7:49 AM, Steven White wrote:
  defType didn't work:
 
 
 
 http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene
 
  Gave me error:
 
  org.apache.solr.search.SyntaxError: Expected identifier at pos 27
  str='{!q.op=AND df=text solr sys'
 
  Is my use of defType correct?

 If everything is at defaults and you don't have defType in the handler
 definition, then defType=lucene doesn't do anything - it specifically
 says use the lucene parser which is the default.  You want
 defType=term instead.

 Thanks,
 Shawn




Re: Differentiating user search term in Solr

2015-04-16 Thread Shawn Heisey
On 4/16/2015 9:37 AM, Steven White wrote:
 What is term in the defType=term, do you mean the raw word term or
 something else?  Because I tried that too in two different ways:

Oops.  I forgot that the term query parser (that's what term means --
the name of the query parser) requires that you specify the field you
are searching on, so that would be incomplete.  Try also setting the f
parameter to the field that you want to search.  I will not be surprised
if that doesn't work, though.

Thanks,
Shawn



Re: Merge indexes in MapReduce

2015-04-16 Thread Erick Erickson
You're stating two things that are somewhat antithetical:
1: We have real-time search and
2: want to merge (and optimize) its indexes into one

Needing to merge indexes implies (to me at least) that
you're not really doing NRT processing as docs in the batch
you're merging into your collection aren't searchable, thus not NRT.

I'm probably missing something obvious in your problem statement

The MapReduceIndexerTool probably doesn't quite do what you want
as its purpose is to add documents to the index and merge at the end...

You might get some value from the core admin API MERGEINDEXES call:
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-MERGEINDEXES

But you have to be careful in a sharded situation to merge exactly
correctly. Plus,
merging indexes does NOT replace documents with a particular
uniqueKey that happens to be both in the source and dest indexes.

I wouldn't worry too much about optimization, despite its name it's
largely irrelevant at this point
unless you have a bunch of deleted documents in your index.

Best,
Erick


On Thu, Apr 16, 2015 at 4:14 AM, Norgorn lsunnyd...@mail.ru wrote:
 Is there a ready-to-use tool to merge existing indexes in map-reduce?
 We have real-time search and want to merge (and optimize) its indexes into
 one, so we don't need to build index in Map-Reduce, but only merge it.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Merge-indexes-in-MapReduce-tp4200106.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Differentiating user search term in Solr

2015-04-16 Thread Shawn Heisey
On 4/16/2015 10:10 AM, Steven White wrote:
 I don't follow what the f parameter is.  Do you have a link where I can
 read more about it?  I found this
 https://wiki.apache.org/solr/HighlightingParameters and
 https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
 what you mean (I'm not doing highlighting for faceting).

It looks like this isn't going to work.  I just tried it on my index.

To see the reasoning behind what I was suggesting, click here:

https://cwiki.apache.org/confluence/display/solr/Other+Parsers

And then click on Term Query Parser in the third column of the list at
the top of that page.

The syntax for the localparams on this one is {!term f=field}querytext
... so I was hoping that f would work as a URL parameter, but from the
test I just did on Solr 4.9.1, that's not the case.

Thanks,
Shawn



Re: 1:M connectivity

2015-04-16 Thread Erick Erickson
You say the SolrCloud API. Not entirely sure what that is, do you
mean the post.jar tool?

Because to get much more scalable throughput, you probably want to use SolrJ and
the CloudSolrServer class. That class takes a connection to Zookeeper and
does the right thing.

Best,
Erick

On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid wrote:
 Given that the index are always routed to specific machine, is there a way to 
 avoid connectivity from the node to all other node.
 We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
 added to the node that get API request for add-index (i.e., we are sending 
 the add index to the appropriate node that should get it).




Re: SolrCloud - Collection Browsing

2015-04-16 Thread Erick Erickson
Check that your config has a valid path to the velocity contrib. You
should see something like

lib dir=${solr.install.dir:../../..}/contrib/velocity/lib regex=.*\.jar /

(from Solr 4.10). and you should also see the indicated file on each
of your Solr nodes.

What's the full stack BTW? I'm expecting something like a class not
found error somewhere
down in the stack.

Best,
Erick

On Thu, Apr 16, 2015 at 3:21 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:
 Hi,

 I have setup a SolrCloud on 3 machines - machine1, machine2 and machine3.
 The DirectoryFactory used is HDFS where the collection index data is stored
 in HDFS within a Hadoop cluster.

 SlorCloud has been setup successfully and everything looks fine so far. I
 have uploaded the default configuration i.e. the conf folder under
 example/collection1 folder under the solr installation directory into
 Zookeeper. Essentially, I have uploaded the default configuration into
 Zookeeper.

 Now when I log in to Solr Admin using http://machine1:8983/solr/admin, I am
 able to see the SolrAdmin page and when I click on Cloud, I could see all
 the shards and replications properly in the browser.

 However, the issue comes when I try to open the page
 http://machine1:8983/solr/mycollection/browse. I am seeing a HTTP 500 lazy
 loading error. This looks like a trivial mistake somewhere as the
 collection is setup fine and everything works normal. However, when I
 browse the collection, this error occurs. Even when I open
 http://machine1:8983/solr/mycollection/query I am getting the json response
 properly with numFound as 0

 I was expecting similar behavior like how the /browse request provides the
 Solritas page.

 Note: I haven't changed any of the configuration in the conf directory.
 Should I modify solrconfig.xml to have a RequestHandler for
 /mycollection/browse or the default one be sufficient?

 Can someone provide some pointers please to get this issue resolved?

 Thanks  Regards
 Vijay

 --
 The contents of this e-mail are confidential and for the exclusive use of
 the intended recipient. If you receive this e-mail in error please delete
 it from your system immediately and notify us either by e-mail or
 telephone. You should not copy, forward or otherwise disclose the content
 of the e-mail. The views expressed in this communication may not
 necessarily be the view held by WHISHWORKS.


RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
If you use pdftotext with a simple fork/exec per document, you will get about 5 
MB/s throughput on a single AMD x86_64.   Much of that is because of the 
fork/exec.   I suggest that you use HTML output and UTF-8 encoding  for the 
PDF, because that way you can get title/keywords and such as http meta keywords.

If you have the appetite for something truly great, try:
 - Socket server listening for parsing requests
 - pass off accept() sockets to pre-forked children
 - in the children, use vfork, rather than fork
 -  tmpfs for outputted HTML documents
 - Tempting to implement using mod_perl and httpd, at least to me.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
Sent: Thursday, April 16, 2015 7:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never ever 
successfully all real-world PDFs and cater for that fact in your requirements 
:-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:
 Erick,

 I tried indexing both ways - SolrJ / Tika's AutoParser and as well as 
 SolrCell's ExtractRequestHandler. Majority of the PDF and Word 
 documents are getting parsed properly and indexed into Solr. However, 
 a minority of them keep failing wither PDFParser or OfficeParser error.

 Not sure if this behaviour can be modified so that all the documents 
 can be indexed. The business requirement we have is to index all the 
 documents.
 However, if a small percentage of them fails, not sure what other ways 
 exist to index them.

 Any help please?


 Thanks  Regards
 Vijay



 On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com wrote:

 There's quite a discussion here:
 https://issues.apache.org/jira/browse/SOLR-7137

 But, I personally am not a huge fan of pushing all the work on to 
 Solr, in a production environment the Solr server is responsible for 
 indexing, parsing the docs through Tika, perhaps searching etc. This 
 doesn't scale all that well.

 So an alternative is to use SolrJ with Tika, which is totally 
 independent of what version of Tika is on the Solr server. Here's an 
 example.

 http://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy 
 vijaya.bhoomire...@whishworks.com wrote:
 Thanks everyone for the responses. Now I am able to index PDF 
 documents successfully. I have implemented manual extraction using 
 Tika's
 AutoParser
 and PDF functionality is working fine. However,  the error with some 
 MS office word documents still persist.

 The error message is java.lang.IllegalArgumentException: This 
 paragraph
 is
 not the first one in the table which will eventually result in
 Unexpected
 RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

 Upon some reading, it looks like its a bug with Tika 1.5 and seems 
 to
 have
 been fixed with Tika 1.6 (
 https://issues.apache.org/jira/browse/TIKA-1251 ).
 I am new to Solr / Tika and hence wondering whether I can change the 
 Tika library alone to v1.6 without impacting any of the libraries 
 within Solr 4.10.2? Please let me know your response and how to get 
 away with this issue.

 Many thanks in advance.

 Thanks  Regards
 Vijay


 On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:

 Vijay,

 You could try different excel files with different formats to rule 
 out
 the
 issue is with TIKA version being used.

 Thanks
 Murthy

 On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
 trhodes...@gmail.com
 wrote:

 Perhaps the PDF is protected and the content can not be extracted?

 i have an unverified suspicion that the tika shipped with solr 
 4.10.2
 may
 not support some/all office 2013 document formats.





 On 4/14/2015 8:18 PM, Jack Krupansky wrote:

 Try doing a manual extraction request directly to Solr (not via
 SolrJ)
 and
 use the extractOnly option to see if the content is actually
 extracted.

 See:
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Data+with+Solr+Cell+using+Apache+Tika

 Also, some PDF files actually have the content as a bitmap image, 
 so
 no
 text is extracted.


 -- Jack Krupansky

 On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi 
 Reddy
 
 vijaya.bhoomire...@whishworks.com wrote:

   Hi,

 I am trying to index PDF and Microsoft Office files (.doc, 
 .docx,
 .ppt,
 .pptx, .xlx, and .xlx) files into Solr. I am facing the 
 following
 issues.
 Request to please let me know what is going wrong with the 
 indexing process.

 I am using solr 4.10.2 and using the default example server
 configuration
 that 

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
Indeed.   Another solution is to purchase ABBYY or Nuance as a server, and have 
them do that work.
You will even get OCR.Both offer a Linux SDK.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, April 16, 2015 7:56 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing PDF and MS Office files

+1

:)

PS: one more thing - please, tell your management that you will never 
ever successfully all real-world PDFs and cater for that fact in your 
requirements :-)



Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
I don't follow what the f parameter is.  Do you have a link where I can
read more about it?  I found this
https://wiki.apache.org/solr/HighlightingParameters and
https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
what you mean (I'm not doing highlighting for faceting).

Thanks

Steve

On Thu, Apr 16, 2015 at 11:54 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 9:37 AM, Steven White wrote:
  What is term in the defType=term, do you mean the raw word term or
  something else?  Because I tried that too in two different ways:

 Oops.  I forgot that the term query parser (that's what term means --
 the name of the query parser) requires that you specify the field you
 are searching on, so that would be incomplete.  Try also setting the f
 parameter to the field that you want to search.  I will not be surprised
 if that doesn't work, though.

 Thanks,
 Shawn




Re: check If I am Still Leader

2015-04-16 Thread Erick Erickson
bq:  I don't use replication so why does it has to check who is the leader

Because the doc must be routed to the correct shard, and the shard leader
is the machine that coordinates the indexing for that shard.

I really question whether this is a fruitful course for you to take. What
specific problems are you trying to solve here? Because trying to take control
at this level really shouldn't be done unless and until you have a problem
that's causing you grief, it's just a waste of energy until then IMO.

Best,
Erick

On Thu, Apr 16, 2015 at 7:59 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 4/16/2015 7:42 AM, Adir Ben Ami wrote:
 I have not mentioned before that the index are always routed to specific 
 machine.
 Is there a way to avoid connectivity from the node to all other nodes?

 That capability has been added in Solr 5.1.0.

 https://issues.apache.org/jira/browse/SOLR-6832

 Thanks,
 Shawn



Re: Differentiating user search term in Solr

2015-04-16 Thread Shawn Heisey
On 4/16/2015 10:18 AM, Shawn Heisey wrote:
 On 4/16/2015 10:10 AM, Steven White wrote:
 I don't follow what the f parameter is.  Do you have a link where I can
 read more about it?  I found this
 https://wiki.apache.org/solr/HighlightingParameters and
 https://wiki.apache.org/solr/SimpleFacetParameters but im not sure this is
 what you mean (I'm not doing highlighting for faceting).
 It looks like this isn't going to work.  I just tried it on my index.

I filed an enhancement issue.  It might never happen, but it's in the
system.

https://issues.apache.org/jira/browse/SOLR-7410

Thanks,
Shawn



Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
Thanks for trying Shawn.

Looks like I have to escape the string on my client side (this isn't a
clean design and can lead to errors if not all reserved tokens are not
escaped).

I hope folks from @dev are reading this and consider adding a parameter to
tell Solr the text is raw-text.

Steve

On Thu, Apr 16, 2015 at 12:18 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 10:10 AM, Steven White wrote:
  I don't follow what the f parameter is.  Do you have a link where I can
  read more about it?  I found this
  https://wiki.apache.org/solr/HighlightingParameters and
  https://wiki.apache.org/solr/SimpleFacetParameters but im not sure
 this is
  what you mean (I'm not doing highlighting for faceting).

 It looks like this isn't going to work.  I just tried it on my index.

 To see the reasoning behind what I was suggesting, click here:

 https://cwiki.apache.org/confluence/display/solr/Other+Parsers

 And then click on Term Query Parser in the third column of the list at
 the top of that page.

 The syntax for the localparams on this one is {!term f=field}querytext
 ... so I was hoping that f would work as a URL parameter, but from the
 test I just did on Solr 4.9.1, that's not the case.

 Thanks,
 Shawn




Re: How can I temporarily detach node from SolrCloud?

2015-04-16 Thread Erick Erickson
bq: it down will either reduce your result set or cause queries to
return an error

Setting shards.tolerant=true will reduce your result set. If you don't set that
and all replicas of a shard are down, you'll get an error.

And indexing won't work if all the replicas for a shard are down.

Best,
Erick


On Thu, Apr 16, 2015 at 7:46 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 4/16/2015 8:27 AM, Oded Sofer wrote:
 How can I detach node from SolrCloud (temporarily for maintenance and such 
 and attach it back after some time). We are using SolrCloud 4.10.0; One 
 Collection, and Shard per node.
 The add-index is routed to specific machine base on our customize routing 
 logic (kind of hard-coded)

 I assume this is just one replica out of multiple ... if that's the
 case, just shut the node down, do your maintenance, and bring it back
 online.  SolrCloud will automatically make sure the index replica(s) on
 the node are brought up to date to match the others.

 If it's not one replica of multiple (that is, if it has the only copy of
 one or more shards), then shutting it down will either reduce your
 result set or cause queries to return an error, not sure which.

 Thanks,
 Shawn



Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.

For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.


Thanks  Regards
Vijay



On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy 
vijaya.bhoomire...@whishworks.com wrote:

 Thanks Tim.

 I shall raise a Jira with the stack trace information.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:54, Allison, Timothy B. talli...@mitre.org wrote:

 This sounds like a Tika issue, let's move discussion to that list.

 If you are still having problems after you upgrade to Tika 1.8, please at
 least submit the stack traces (if you can) to the Tika jira.  We may be
 able to find a document that triggers that stack trace in govdocs1 or the
 slice of CommonCrawl that Julien Nioche contributed to our eval effort.

 Tika is not perfect and it will fail on some files, but we are always
 working to improve it.

 Best,

   Tim

 -Original Message-
 From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
 vijaya.bhoomire...@whishworks.com]
 Sent: Thursday, April 16, 2015 7:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing PDF and MS Office files

 Thanks Allison.

 I tried with the mentioned changes. But still no luck. I am using the code
 from lucidworks site provided by Erick and now included the changes
 mentioned by you. But still the issue persists with a small percentage of
 documents (both PDF and MS Office documents) failing. Unfortunately, these
 documents are proprietary and client-confidential and hence I am not sure
 whether they can be uploaded into Jira.

 These files normally open in Adobe Reader and MS Office tools.

 Thanks  Regards
 Vijay


 On 16 April 2015 at 12:33, Allison, Timothy B. talli...@mitre.org
 wrote:

  I entirely agree with Erick -- it is best to isolate Tika in its own jvm
  if you can -- bad things can happen if you don't [1] [2].
 
  Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
  embedded documents/attachments, make sure to set the parser in the
  ParseContext before parsing:
 
  ParseContext context = new ParseContext();
  //add this line:
  context.set(Parser.class, _autoParser)
   InputStream input = new FileInputStream(file);
 
  Tika 1.8 is soon to be released.  If that doesn't fix your problems,
  please submit stacktraces (and docs, if possible) to the Tika jira, and
  we'll try to make the fixes.
 
  Cheers,
 
  Tim
 
  [1]
 
 http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
  [2]
 
 http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
  -Original Message-
  From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
  vijaya.bhoomire...@whishworks.com]
  Sent: Thursday, April 16, 2015 7:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Indexing PDF and MS Office files
 
  Erick,
 
  I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
  SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
  are getting parsed properly and indexed into Solr. However, a minority
 of
  them keep failing wither PDFParser or OfficeParser error.
 
  Not sure if this behaviour can be modified so that all the documents
 can be
  indexed. The business requirement we have is to index all the documents.
  However, if a small percentage of them fails, not sure what other ways
  exist to index them.
 
  Any help please?
 
 
  Thanks  Regards
  Vijay
 
 
 
  On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
 wrote:
 
   There's quite a discussion here:
   https://issues.apache.org/jira/browse/SOLR-7137
  
   But, I personally am not a huge fan of pushing all the work on to
 Solr,
  in
   a
   production environment the Solr server is responsible for indexing,
   parsing the
   docs through Tika, perhaps searching etc. This doesn't scale all that
  well.
  
   So an alternative is to use SolrJ with Tika, which is totally
 independent
   of
   what version of Tika is on the Solr server. Here's an example.
  
   http://lucidworks.com/blog/indexing-with-solrj/
  
   Best,
   Erick
  
   On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
   vijaya.bhoomire...@whishworks.com wrote:
Thanks everyone for the responses. Now I am able to index PDF
 documents
successfully. I have implemented manual extraction using Tika's
   AutoParser
and PDF functionality is working fine. However,  the error with
 some MS
office word documents still persist.
   
The error message is java.lang.IllegalArgumentException: This
  paragraph
   is
not the first one in the table which will 

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull

On 16/04/2015 12:53, Siegfried Goeschl wrote:

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)


Here's some file extractors we built a while ago:
https://github.com/flaxsearch/flaxcode/tree/master/flax_filters
You might find them useful: they use a number of external programs 
including pdf2text and headless Open Office.


Cheers

Charlie


If you start command line tools from your JVM please have a look at
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never
ever successfully all real-world PDFs and cater for that fact in your
requirements :-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Erick,

I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.

Not sure if this behaviour can be modified so that all the documents
can be
indexed. The business requirement we have is to index all the documents.
However, if a small percentage of them fails, not sure what other ways
exist to index them.

Any help please?


Thanks  Regards
Vijay



On 15 April 2015 at 15:20, Erick Erickson erickerick...@gmail.com
wrote:


There's quite a discussion here:
https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to
Solr, in
a
production environment the Solr server is responsible for indexing,
parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that
well.

So an alternative is to use SolrJ with Tika, which is totally
independent
of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
vijaya.bhoomire...@whishworks.com wrote:

Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's

AutoParser

and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is java.lang.IllegalArgumentException: This
paragraph

is

not the first one in the table which will eventually result in

Unexpected

RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

Upon some reading, it looks like its a bug with Tika 1.5 and seems to

have

been fixed with Tika 1.6 (

https://issues.apache.org/jira/browse/TIKA-1251 ).

I am new to Solr / Tika and hence wondering whether I can change the
Tika
library alone to v1.6 without impacting any of the libraries within
Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks  Regards
Vijay


On 15 April 2015 at 05:14, Shyam R shyam.reme...@gmail.com wrote:


Vijay,

You could try different excel files with different formats to rule out

the

issue is with TIKA version being used.

Thanks
Murthy

On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes trhodes...@gmail.com
wrote:


Perhaps the PDF is protected and the content can not be extracted?

i have an unverified suspicion that the tika shipped with solr 4.10.2

may

not support some/all office 2013 document formats.





On 4/14/2015 8:18 PM, Jack Krupansky wrote:


Try doing a manual extraction request directly to Solr (not via

SolrJ)

and

use the extractOnly option to see if the content is actually

extracted.


See:
https://cwiki.apache.org/confluence/display/solr/
Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so

no

text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy



vijaya.bhoomire...@whishworks.com wrote:

  Hi,


I am trying to index PDF and Microsoft Office files (.doc, .docx,

.ppt,

.pptx, .xlx, and .xlx) files into Solr. I am facing the following

issues.

Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server

configuration

that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.*

in

the
Solr Query console, metadata information is displayed properly.

However,

the PDF content field is empty. This is happening for all PDF files

I

have
tried. I have tried with some proprietary files, PDF eBooks etc.

Whatever

be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect

and

the
extracted content is visible in the query console. However, for

others, I

see the below error message during the indexing process.

*Exception in thread 

How can I temporarily detach node from SolrCloud?

2015-04-16 Thread Oded Sofer
How can I detach node from SolrCloud (temporarily for maintenance and such and 
attach it back after some time). We are using SolrCloud 4.10.0; One Collection, 
and Shard per node. 
The add-index is routed to specific machine base on our customize routing logic 
(kind of hard-coded) 



Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
Thanks Shawn.

I cannot use escapeQueryChars method because my app interacts with Solr via
REST.

The summary of your email is: client's must escape search string to prevent
Solr from failing.

It would be a nice addition to Solr to provide a new query parameter that
tells it to treat the query text as literal text.  Doing so, means you
remove the burden placed on clients to understand and escape reserved Solr
/ Lucene tokens.

Steve

On Wed, Apr 15, 2015 at 7:18 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/15/2015 3:54 PM, Steven White wrote:
  Hi folks,
 
  If a user types in the search box (without quotes): {!q.op=AND df=text
  solr sys and I take that text and build the URL like so:
 
 
 http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=true
 
  This will fail with Expected identifier because it is not a valid Solr
  text.

 That isn't valid syntax for the lucene query parser ... the localparams
 are not closed (it would require a } character), and after the
 localparams there would need to be some additional text.

  My question is this: is there a flag I can send to Solr with the URL
  telling it to treat what's in q as raw text vs. having it to process it
  as a Solr syntax?  If not, than it means I have to escape all Solr
 reserved
  characters and words.  If so, where can I find the complete list?  Also,
  what happens when a new reserved characters or word is added to Solr down
  the road?  It means I have to upgrade my application too, which is
  something I would like to avoid.

 One way to treat the entire input as literal text is to use the terms
 query parser ... but that requires the localparams syntax, and I do not
 know exactly what is going to happen if you use a query string that
 itself is localparams syntax -- {! other params} ... so escaping is
 probably safer.


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

 The other way to handle it is to escape every special character with a
 backslash.  The escapeQueryChars method in SolrJ is always kept up to
 date, and can escape every special character.


 http://lucene.apache.org/solr/4_10_3/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29

 The javadoc for that method points to the queryparser syntax for more
 info on characters that need escaping.  Scroll to the very end of this
 page:


 http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true

 That page lists || and  rather than just the single characters | and 
 ... the escapeQueryChars method in SolrJ will escape both characters, as
 it only works at the character level, not the string level.

 If you want the *spaces* in your query to be treated literally also, you
 must escape them too.  The escapeQueryChars method I've mentioned will
 NOT escape spaces.

 Note that this does not cover URL escaping -- the  character must be
 sent as %26 or the servlet container will treat it as a special
 character, before it even gets to Solr.

 Thanks,
 Shawn




RE: check If I am Still Leader

2015-04-16 Thread Adir Ben Ami







I have not mentioned before that the index are always routed to specific 
machine.
Is there a way to avoid connectivity from the node to all other nodes? 



 From: adi...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: check If I am Still Leader
 Date: Thu, 16 Apr 2015 16:08:15 +0300
 
 
 Hi,
 
 I am using Solr 4.10.0 with tomcat and embedded Zookeeper.
 I use SolrCloud in my system.
 
 Each Shard machine try to reach/connect with other cluster machines in order 
 to index the document ,it just checks if it is still the leader.
  I don't use replication so why does it has to check who is the leader?
 How can I bypass this constraint and make my solrcloud not use 
 ClusterStateUpdater.checkIfIamStillLeader when i am indexing?
 
 Thanks,
 Adir. 
   
  

1:M connectivity

2015-04-16 Thread Oded Sofer
Given that the index are always routed to specific machine, is there a way to 
avoid connectivity from the node to all other node. 
We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
added to the node that get API request for add-index (i.e., we are sending the 
add index to the appropriate node that should get it). 




Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
defType didn't work:


http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sysfl=id%2Cscore%2Ctitlewt=xmlindent=truedefType=lucene

Gave me error:

org.apache.solr.search.SyntaxError: Expected identifier at pos 27
str='{!q.op=AND df=text solr sys'

Is my use of defType correct?

Steve

On Thu, Apr 16, 2015 at 9:15 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 4/16/2015 7:09 AM, Steven White wrote:
  I cannot use escapeQueryChars method because my app interacts with Solr
 via
  REST.
 
  The summary of your email is: client's must escape search string to
 prevent
  Solr from failing.
 
  It would be a nice addition to Solr to provide a new query parameter that
  tells it to treat the query text as literal text.  Doing so, means you
  remove the burden placed on clients to understand and escape reserved
 Solr
  / Lucene tokens.

 That's a good idea, although we might already have that.

 I wonder what happens if you include defType=term with your request?
 That works for edismax, it might work for other query parsers, at least
 on the q parameter.

 Thanks,
 Shawn




custom search component on solrcloud

2015-04-16 Thread Robust Links
Hi

Apologize for sending this again. I am trying to port my none solrcloud
custom search handler to a solrcloud one. I have read the
WritingDistibutedSearchComponents
http://wiki.apache.org/solr/WritingDistributedSearchComponents wiki page
and looked at Terms and Querycomponent codes but the control flow of
execution is still fuzzy (even given the “distributed algorithm”
description).

Concretely, I have a none solrcloud algorithm that given a sequence of
tokens T would

1- split T into single tokens

2- foreach token t_i

get all the DocList for t_i by executing rb.req.getSearcher().getDocList in
process() method of the custom search component

3- do some magic on the collection of doclists

My question is how can i

1) do the splitting (step 1 above) in a single shard, and

2) distribute the getDocList for each token t_i to all shards

3) wait till i have all the doclists from all shards, then

4) do something with the results, in the original calling shard (step 1
above).


Thank you for your help


Re: Spurious _version_ conflict?

2015-04-16 Thread Chris Hostetter

: I notice that the expected value in the error message matches both what 
: I pass in and the index contents.  But the actual value in the error 
: message is different only in the last (low order) two digits.  
: Consistently.

what does your client code look like?  Are you sure you aren't being bit 
by a JSON parsing library that can't handle long values and winds up 
truncating them?

https://issues.apache.org/jira/browse/SOLR-6364



-Hoss
http://www.lucidworks.com/


Re: Differentiating user search term in Solr

2015-04-16 Thread Chris Hostetter

: The summary of your email is: client's must escape search string to prevent
: Solr from failing.
: 
: It would be a nice addition to Solr to provide a new query parameter that
: tells it to treat the query text as literal text.  Doing so, means you
: remove the burden placed on clients to understand and escape reserved Solr
: / Lucene tokens.

i'm a little lost as to what exactly you want to do here -- but i'm going 
to focus on your thesis statement here, and assume that you want to 
search on a literal piece of text and you don't want to have to worry 
about escaping any characters and you don't wantsolr to treat any part of 
the query string as special.

the only way something like that works is if you only want to search a 
single field -- searching multiple fields, searching multiple clauses, 
etc... none of those types of options make sense in this context.

people have already mentioned the term parser -- which is fine ifyou 
want to serach for exactly one literal term, but as a more generally 
solution, what people usualy want, is the field parser -- which works 
better with TextFields in general...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser

Just like the comment you've seen about the term parser needing an f 
localparam to specify the field, the same is true for the field parser.  
but variable refrences make this trivial to specify -- instead of using 
the full {!field f=myfield}Foo Bar syntax in your q param, you can use 
an alternate param (qq is common in many examples) for the raw data from 
the user...

q={!field f=myfield v=$qq}  qq=whatever your usertypes


https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries


-Hoss
http://www.lucidworks.com/


Re: Solr 5.x deployment in production

2015-04-16 Thread Steven White
Thanks Karl.

In my case, I have to deploy Solr on Windows, AIX, and Linux (all server
edition).  We are a WebSphere shop, moving away from it means I have to
deal with politics and culture.

For Windows, I cannot use NSSM so I have to figure a solution managing Solr
(at least start-up and shutdown).  If anyone has experience in this area
(now that Solr is not in a WAS profile managed by Windows services) and can
share your experience, please do.  Thanks.

Steve

On Thu, Apr 16, 2015 at 3:49 PM, Karl Kildén karl.kil...@gmail.com wrote:

 I asked a very similar question recently. You should switch to using the
 package as is and forget that it contains a .war. The war is now an
 internal component. Also switch to the new script for startup etc.

 I have seen several disappointed users that disagree with this decision but
 I assume the project now has more freedom in the future and also more
 alignment and focus on one experience.

 I did my own thing with NSSM because we use windows and I am satisfied.

 On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote:

  Hi folks,
 
  With Solr 5.0, the WAR file is deprecated and I see Jetty is included
 with
  Solr.  What if I have my own Web server into which I need to deploy Solr,
  how do I go about doing this correctly without messing things up and
 making
  sure Solr works?  Or is this not recommended and Jetty is the way to go,
 no
  questions asked?
 
  Thanks
 
  Steve
 



Re: 1:M connectivity

2015-04-16 Thread Oded Sofer
Right, we are using that. 
The issue is the firewall setting needed for the cloud. We do not want to open 
all nodes to all others nodes. However, we found that add-index to a specific 
node tries to access all other nodes though we set it to index locally on that 
node only. 


On Apr 16, 2015 7:19 PM, Erick Erickson erickerick...@gmail.com wrote:

 You say the SolrCloud API. Not entirely sure what that is, do you 
 mean the post.jar tool? 

 Because to get much more scalable throughput, you probably want to use SolrJ 
 and 
 the CloudSolrServer class. That class takes a connection to Zookeeper and 
 does the right thing. 

 Best, 
 Erick 

 On Thu, Apr 16, 2015 at 7:19 AM, Oded Sofer odedso...@yahoo.com.invalid 
 wrote: 
  Given that the index are always routed to specific machine, is there a way 
  to avoid connectivity from the node to all other node. 
  We are using Solr 4.10; the Add/Update Index uses SolrCloud API and always 
  added to the node that get API request for add-index (i.e., we are sending 
  the add index to the appropriate node that should get it). 
  
  


Spurious _version_ conflict?

2015-04-16 Thread Reitzel, Charles
Hi All,

I have been getting intermittent 409 conflict responses to updates.  I check 
and double-check that the _version_ I am passing in matches the current value 
in the index.

I notice that the expected value in the error message matches both what I pass 
in and the index contents.  But the actual value in the error message is 
different only in the last (low order) two digits.   Consistently.

I noticed a similar report a while back:
http://lucene.472066.n3.nabble.com/Version-Conflict-on-Atomic-Update-td4083587.html

Any  thoughts?

Thanks,
Charlie

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop
Thanks Kaushik  Erick..

Though I can populate uuid by using combination of fields but need to
change the type to string else it throws Invalid UUID String
field name=uuid type=string indexed=true stored=true
required=true multiValued=false/

a) I will have ~80 millions records and wondering if performance might be
issue
b) So, during update I can still use combination of fields i.e. uuid ?

On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com
wrote:

 This seems relevant:


 http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid

 Best,
 Erick

 On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote:
  You seem to have defined the field, but not populating it in the query.
 Use
  a combination of fields to come up with a unique id that can be assigned
 to
  uuid. Does that make sense?
 
  Kaushik
 
  On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
  wrote:
 
  How to generate uuid/ id (maybe in data-config.xml...) for table which
 do
  not have any primary key.
 
  Scenario :
  Using DIH I need to import data from database but table does not have
 any
  primary key
  I do have uuid defined in schema.xml and is
  field name=uuid type=uuid indexed=true stored=true
 required=true
  multiValued=false/
  uniqueKeyuuid/uniqueKey
 
  data-config.xml
  ?xml version=1.0 encoding=UTF-8 ?
  dataConfig
  dataSource
batchSize=2000
name=test
type=JdbcDataSource
driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@ldap:
user=myUser
password=pwd/
  document
  entity name=test_entity
docRoot=true
dataSource=test
query=select name, age from test_user
  /entity
  /document
  /dataConfig
 
  Error : Document is missing mandatory uniqueKey field: uuid
 



Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop
Just wondering if there is a way to generate uuid/ id in data-config
without using combination of fields in query...

data-config.xml
?xml version=1.0 encoding=UTF-8 ?
dataConfig
dataSource
  batchSize=2000
  name=test
  type=JdbcDataSource
  driver=oracle.jdbc.OracleDriver
  url=jdbc:oracle:thin:@ldap:
  user=myUser
  password=pwd/
document
entity name=test_entity
  docRoot=true
  dataSource=test
  query=select name, age from test_user
/entity
/document
/dataConfig

On Thu, Apr 16, 2015 at 3:18 PM, Vishal Swaroop vishal@gmail.com
wrote:

 Thanks Kaushik  Erick..

 Though I can populate uuid by using combination of fields but need to
 change the type to string else it throws Invalid UUID String
 field name=uuid type=string indexed=true stored=true
 required=true multiValued=false/

 a) I will have ~80 millions records and wondering if performance might be
 issue
 b) So, during update I can still use combination of fields i.e. uuid ?

 On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 This seems relevant:


 http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid

 Best,
 Erick

 On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote:
  You seem to have defined the field, but not populating it in the query.
 Use
  a combination of fields to come up with a unique id that can be
 assigned to
  uuid. Does that make sense?
 
  Kaushik
 
  On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
  wrote:
 
  How to generate uuid/ id (maybe in data-config.xml...) for table which
 do
  not have any primary key.
 
  Scenario :
  Using DIH I need to import data from database but table does not have
 any
  primary key
  I do have uuid defined in schema.xml and is
  field name=uuid type=uuid indexed=true stored=true
 required=true
  multiValued=false/
  uniqueKeyuuid/uniqueKey
 
  data-config.xml
  ?xml version=1.0 encoding=UTF-8 ?
  dataConfig
  dataSource
batchSize=2000
name=test
type=JdbcDataSource
driver=oracle.jdbc.OracleDriver
url=jdbc:oracle:thin:@ldap:
user=myUser
password=pwd/
  document
  entity name=test_entity
docRoot=true
dataSource=test
query=select name, age from test_user
  /entity
  /document
  /dataConfig
 
  Error : Document is missing mandatory uniqueKey field: uuid
 





Solr 5.x deployment in production

2015-04-16 Thread Steven White
Hi folks,

With Solr 5.0, the WAR file is deprecated and I see Jetty is included with
Solr.  What if I have my own Web server into which I need to deploy Solr,
how do I go about doing this correctly without messing things up and making
sure Solr works?  Or is this not recommended and Jetty is the way to go, no
questions asked?

Thanks

Steve


Re: Solr 5.x deployment in production

2015-04-16 Thread Karl Kildén
I asked a very similar question recently. You should switch to using the
package as is and forget that it contains a .war. The war is now an
internal component. Also switch to the new script for startup etc.

I have seen several disappointed users that disagree with this decision but
I assume the project now has more freedom in the future and also more
alignment and focus on one experience.

I did my own thing with NSSM because we use windows and I am satisfied.

On 16 April 2015 at 21:36, Steven White swhite4...@gmail.com wrote:

 Hi folks,

 With Solr 5.0, the WAR file is deprecated and I see Jetty is included with
 Solr.  What if I have my own Web server into which I need to deploy Solr,
 how do I go about doing this correctly without messing things up and making
 sure Solr works?  Or is this not recommended and Jetty is the way to go, no
 questions asked?

 Thanks

 Steve



Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood
Turning PDF back into a structured document is like trying to turn hamburger 
back into a cow.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. talli...@mitre.org wrote:

 +1 
 
 :)
 
 PS: one more thing - please, tell your management that you will never 
 ever successfully all real-world PDFs and cater for that fact in your 
 requirements :-)
 



SolrCloud Core Reload

2015-04-16 Thread Vincenzo D'Amore
Hi all,

I have a solrcloud cluster with 3 server and there are many cores.
Using the SolrCloud UI Admin Core, if I execute core optimize (or
reload), all the core in the cluster will be optimized or reloaded? or
only the selected core?.

Best regards,
Vincenzo


Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe
This looks like a bug. The logic to merge range facets from shards seems to
only be merging counts, not the first level elements.
Could you create a Jira?

On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example scripts
 and data. I am running on Windows and tested both Solr 5.0.0 and 5.1.0.
 This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will



SolrJ Exceptions

2015-04-16 Thread Bryan Bende
I'm trying to identify the difference between an exception when Solr is in
a bad state/down vs. when it is up but an invalid request was made (maybe
some bad data sent in).

The JavaDoc for SolrRequest process() says:


*@throws SolrServerException if there is an error on the Solr server@throws
IOException if there is a communication error*

So I expected IOException when Solr was down, but it looks like it actually
throws a SolrServerException which has a cause of an IOException.

I'm also not sure how SolrException fits into all of this...

Is anyone familiar with when to generally expect these types of exceptions?

I'm interested in both cloud and stand-alone scenarios, and using Solr 5.0
or 5.1.

Thanks,

Bryan


SolrCloud 4.8.0 upgrade

2015-04-16 Thread Vincenzo D'Amore
Hi All,

I have a SolrCloud cluster with 3 server, I would like to use stats.facet,
but this feature is available only if I upgrade to 4.10.

May I simply redeploy new solr cloud version in tomcat or should reload all
the documents?
There are other drawbacks?

Best regards,
Vincenzo


Re: 5.1 'unique' facet function / calcDistinct

2015-04-16 Thread Yonik Seeley
Thanks for the feedback Levan!
Could you open a JIRA issue for unique() on numeric/date fields?
We don't yet have explicit numeric support for unique() and I think
some changes in Lucene 5 broke treating these fields as strings (i.e.
the ability to retrieve ords).

-Yonik


On Thu, Apr 16, 2015 at 7:46 AM, levanDev levandev9...@gmail.com wrote:
 Hello,

 We are looking at a couple of options for using solr to dynamically calulate
 unique values per field. In testing out Solr 5.1, I've been using the
 unique() facet function:

 http://yonik.com/solr-facet-functions/

 Overall, loving the JSON Facet API, especially the sub-faceting thus far.

 Here's my two part question:

 I. When I use the unique aggregation function on a string field
 (uniqueValues:'unique(myStringField)'), it works as expected, returns the
 number of unique fields. However when I pass in an int -- or date -- field
 (uniqueValues:'unique(myIntField)') the resulting count is 0. The cause
 might be something else, but if it can be replicated by another user, would
 be great to discuss the unique function further -- in our current use-case,
 we have a field where under 20 unique values are present but the values are
 ints.

 II. Is there a way to use the stats.calcdistinct functionality and only
 return the countDistinct portion of the response and not the full list of
 distinct values -- as provided in the distinctValues portion of the
 response. In a field with high cardinality the response size becomes too
 large.

 If there is no such option, could someone point me in the right direction
 for implementing a custom solution?

 Thank you for your time,
 Levan


Re: Differentiating user search term in Solr

2015-04-16 Thread Steven White
Hi Hoss,

Maybe I'm missing something, but I tried this and got 1 hit:


http://localhost:8983/solr/db/select?q=title:(Apache%20Solr%20Notes)fl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND

Than I tried this and got 0 hit:


http://localhost:8983/solr/db/select?q={!field%20f=title%20v=$qq}qq=Apache%20Solr%20Notesfl=id%2Cscore%2Ctitlewt=xmlindent=trueq.op=AND

It looks to me that f with qq is doing phrase search, that's not what I
want.  The data in the field title is Apache Solr Release Notes

I looked over the links you provided and tried out the examples, in each
case if the user-typed-text contains any reserved characters, it will fail
with a syntax error (the exception is when I used f and qq but like I
said, that gave me 0 hit).

If you can give me a concrete example, please do.  My need is to pass to
Solr the text Apache: Solr Notes (without quotes) and get a hit as if I
passed Apache\: Solr Notes ?

Thanks

Steve

On Thu, Apr 16, 2015 at 5:49 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : The summary of your email is: client's must escape search string to
 prevent
 : Solr from failing.
 :
 : It would be a nice addition to Solr to provide a new query parameter that
 : tells it to treat the query text as literal text.  Doing so, means you
 : remove the burden placed on clients to understand and escape reserved
 Solr
 : / Lucene tokens.

 i'm a little lost as to what exactly you want to do here -- but i'm going
 to focus on your thesis statement here, and assume that you want to
 search on a literal piece of text and you don't want to have to worry
 about escaping any characters and you don't wantsolr to treat any part of
 the query string as special.

 the only way something like that works is if you only want to search a
 single field -- searching multiple fields, searching multiple clauses,
 etc... none of those types of options make sense in this context.

 people have already mentioned the term parser -- which is fine ifyou
 want to serach for exactly one literal term, but as a more generally
 solution, what people usualy want, is the field parser -- which works
 better with TextFields in general...


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FieldQueryParser

 Just like the comment you've seen about the term parser needing an f
 localparam to specify the field, the same is true for the field parser.
 but variable refrences make this trivial to specify -- instead of using
 the full {!field f=myfield}Foo Bar syntax in your q param, you can use
 an alternate param (qq is common in many examples) for the raw data from
 the user...

 q={!field f=myfield v=$qq}  qq=whatever your usertypes



 https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries


 -Hoss
 http://www.lucidworks.com/



Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe
Should be fixed in 5.2. See https://issues.apache.org/jira/browse/SOLR-7412

On Thu, Apr 16, 2015 at 3:18 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 This looks like a bug. The logic to merge range facets from shards seems
 to only be merging counts, not the first level elements.
 Could you create a Jira?

 On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example
 scripts and data. I am running on Windows and tested both Solr 5.0.0 and
 5.1.0. This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will





generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Vishal Swaroop
How to generate uuid/ id (maybe in data-config.xml...) for table which do
not have any primary key.

Scenario :
Using DIH I need to import data from database but table does not have any
primary key
I do have uuid defined in schema.xml and is
field name=uuid type=uuid indexed=true stored=true required=true
multiValued=false/
uniqueKeyuuid/uniqueKey

data-config.xml
?xml version=1.0 encoding=UTF-8 ?
dataConfig
dataSource
  batchSize=2000
  name=test
  type=JdbcDataSource
  driver=oracle.jdbc.OracleDriver
  url=jdbc:oracle:thin:@ldap:
  user=myUser
  password=pwd/
document
entity name=test_entity
  docRoot=true
  dataSource=test
  query=select name, age from test_user
/entity
/document
/dataConfig

Error : Document is missing mandatory uniqueKey field: uuid


Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Kaushik
You seem to have defined the field, but not populating it in the query. Use
a combination of fields to come up with a unique id that can be assigned to
uuid. Does that make sense?

Kaushik

On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
wrote:

 How to generate uuid/ id (maybe in data-config.xml...) for table which do
 not have any primary key.

 Scenario :
 Using DIH I need to import data from database but table does not have any
 primary key
 I do have uuid defined in schema.xml and is
 field name=uuid type=uuid indexed=true stored=true required=true
 multiValued=false/
 uniqueKeyuuid/uniqueKey

 data-config.xml
 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
 dataSource
   batchSize=2000
   name=test
   type=JdbcDataSource
   driver=oracle.jdbc.OracleDriver
   url=jdbc:oracle:thin:@ldap:
   user=myUser
   password=pwd/
 document
 entity name=test_entity
   docRoot=true
   dataSource=test
   query=select name, age from test_user
 /entity
 /document
 /dataConfig

 Error : Document is missing mandatory uniqueKey field: uuid



Re: generate uuid/ id for table which do not have any primary key

2015-04-16 Thread Erick Erickson
This seems relevant:

http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid

Best,
Erick

On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote:
 You seem to have defined the field, but not populating it in the query. Use
 a combination of fields to come up with a unique id that can be assigned to
 uuid. Does that make sense?

 Kaushik

 On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com
 wrote:

 How to generate uuid/ id (maybe in data-config.xml...) for table which do
 not have any primary key.

 Scenario :
 Using DIH I need to import data from database but table does not have any
 primary key
 I do have uuid defined in schema.xml and is
 field name=uuid type=uuid indexed=true stored=true required=true
 multiValued=false/
 uniqueKeyuuid/uniqueKey

 data-config.xml
 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
 dataSource
   batchSize=2000
   name=test
   type=JdbcDataSource
   driver=oracle.jdbc.OracleDriver
   url=jdbc:oracle:thin:@ldap:
   user=myUser
   password=pwd/
 document
 entity name=test_entity
   docRoot=true
   dataSource=test
   query=select name, age from test_user
 /entity
 /document
 /dataConfig

 Error : Document is missing mandatory uniqueKey field: uuid