date:20170327

Re: Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo

Yes, the sample document sizes are not very big. And also, the sample
documents have a mixture of documents that consists of inline images, and
also documents which are searchable (text extractable without OCR)

I suppose only those documents which requires OCR will slow down the
indexing? Which is why the total average is only slowing down by 10 times.

Regards,
Edwin


On 28 March 2017 at 12:06, Phil Scadden  wrote:

> Only by 10? You must have quite small documents. OCR is extremely
> expensive process. Indexing is trivial by comparison. For quite large
> documents I am working with OCR can be 100 times slower than indexing a PDF
> that is searchable (text extractable without OCR).
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: Tuesday, 28 March 2017 4:13 p.m.
> To: solr-user@lucene.apache.org
> Subject: Indexing speed reduced significantly with OCR
>
> Hi,
>
> Does the indexing speed of Solr reduced significantly when we are using
> Tesseract OCR to extract scanned inline images from PDF?
>
> I found that after I implement the solution to extract those scanned
> images from PDF, the indexing speed is now slower by almost more than 10
> times.
>
> I'm using Solr 6.4.2, and Tika App 1.1.4.
>
> Regards,
> Edwin
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

RE: Indexing speed reduced significantly with OCR

2017-03-27 Thread Phil Scadden

Only by 10? You must have quite small documents. OCR is extremely expensive 
process. Indexing is trivial by comparison. For quite large documents I am 
working with OCR can be 100 times slower than indexing a PDF that is searchable 
(text extractable without OCR).

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Tuesday, 28 March 2017 4:13 p.m.
To: solr-user@lucene.apache.org
Subject: Indexing speed reduced significantly with OCR

Hi,

Does the indexing speed of Solr reduced significantly when we are using 
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images 
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo

Hi,

Does the indexing speed of Solr reduced significantly when we are using
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin

Re: OCR not working occasionally

2017-03-27 Thread Zheng Lin Edwin Yeo

I have found this solution in Stackoverflow from Tim Allison to be working.

http://stackoverflow.com/questions/32354209/apache-
tika-extract-scanned-pdf-files

Regards,
Edwin

On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo  wrote:

> This is my settings in the PDFParser.properties file
> under tika-parsers-1.13.jar
>
> enableAutoSpace true
> extractAnnotationText true
> sortByPosition false
> suppressDuplicateOverlappingText false
> extractAcroFormContent true
> extractInlineImages true
> extractUniqueInlineImagesOnly true
> checkExtractAccessPermission false
> allowExtractionForAccessibility true
> ifXFAExtractOnlyXFA false
> catchIntermediateIOExceptions true
>
> Regards,
> Edwin
>
>
> On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo 
> wrote:
>
>> Hi Rick,
>>
>> Thanks for your reply.
>> I saw this error message for the file which has a failure.
>> Am I able to index such files together with the other files which store
>> text as an image together in the same indexing threads?
>>
>>
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> start commit{,optimize=false,openSearcher=true,waitSearcher=true,e
>> xpungeDeletes=false,softCommit=false,prepareCommit=false}
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
>> setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
>> 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://192.168.99.1:8984/solr/
>> collection1_shard1_replica1: Expected mime type application/octet-stream
>> but got text/html. 
>> 
>> 
>> Error 404 
>> 
>> 
>> HTTP ERROR: 404
>> Problem accessing /solr/collection1_shard1_replica1/update. Reason:
>> Not Found
>> 
>> 
>> 
>>
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth
>> od(HttpSolrClient.java:578)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:279)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:268)
>> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
>> .request(ConcurrentUpdateSolrClient.java:430)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD
>> istributor.java:293)
>> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(
>> SolrCmdDistributor.java:282)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>>
>> 2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
>> Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
>> 2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> end_commit_flush
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener sending requests to 
>> Searcher@77e108d5[collection1_shard1_replica2]
>> main{ExitableDirectoryReader(UninvertingDirectoryReader(Unin
>> verting(_0(6.4.2):C3)))}
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener done.
>> 2017-03-19 01:02:26.659 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.SolrCore
>> [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5
>> [collection1_shard1_replica2]

Closed connection issue while doing dataimport

2017-03-27 Thread santosh sidnal

Hi All,

i am facing closed connection issue while doing dataimporter, any solution
to this> stack trace is as below


[3/27/17 8:54:41:399 CDT] 00b4 OracleDataSto >  findMappingClass for :
Entry
 java.sql.SQLRecoverableException: Closed
Connection
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3640)
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3680)
at
oracle.jdbc.OracleConnectionWrapper.commit(OracleConnectionWrapper.java:140)
at
com.ibm.ws.rsadapter.jdbc.WSJdbcConnection.commit(WSJdbcConnection.java:1113)
at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:432)
at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:421)
at
com.ibm.commerce.solr.handler.SchemaJdbcDataSource.close(SchemaJdbcDataSource.java:289)
at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:294)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:283)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)

[3/27/17 8:54:41:399 CDT] 00b4 OracleDataSto <  findMappingClass return
 Exit
 class
com.ibm.websphere.ce.cm.StaleConnectionException
[3/27/17 8:54:41:401 CDT] 00b4 StaleConnecti 3   The stack trace for
the staleConn is:
 java.sql.SQLRecoverableException: Closed
Connection
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3640)
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3680)
at
oracle.jdbc.OracleConnectionWrapper.commit(OracleConnectionWrapper.java:140)
at
com.ibm.ws.rsadapter.jdbc.WSJdbcConnection.commit(WSJdbcConnection.java:1113)
at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:432)
at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:421)
at
com.ibm.commerce.solr.handler.SchemaJdbcDataSource.close(SchemaJdbcDataSource.java:289)
at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:294)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:283)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)

[3/27/17 8:54:41:401 CDT] 00b4 GenericDataSt <  mapExceptionHelper:
Mapping was done returning: Exit

 com.ibm.websphere.ce.cm.StaleConnectionException: Closed Connection
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
at
com.ibm.websphere.rsadapter.GenericDataStoreHelper.mapExceptionHelper(GenericDataStoreHelper.java:620)
at
com.ibm.websphere.rsadapter.GenericDataStoreHelper.mapException(GenericDataStoreHelper.java:682)
at com.ibm.ws.rsadapter.AdapterUtil.mapException(AdapterUtil.java:2112)
at com.ibm.ws.rsadapter.jdbc.WSJdbcUtil.mapException(WSJdbcUtil.java:1047)
at
com.ibm.ws.rsadapter.jdbc.WSJdbcConnection.commit(WSJdbcConnection.java:1151)
at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:432)
at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:421)
at
com.ibm.commerce.solr.handler.SchemaJdbcDataSource.close(SchemaJdbcDataSource.java:289)
at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:294)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:283)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:464)
 Begin backtrace for Nested Throwables
java.sql.SQLRecoverableException: Closed Connection
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3640)
at
oracle.jdbc.driver.PhysicalConnection.commit(PhysicalConnection.java:3680)
at
oracle.jdbc.OracleConnectionWrapper.commit(OracleConnectionWrapper.java:140)
at
com.ibm.ws.rsadapter.jdbc.WSJdbcConnection.commit(WSJdbcConnection.java:1113)
at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:432)
at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:421)
at
com.ibm.commerce.solr.handler.SchemaJdbcDataSource.close(SchemaJdbcDataSource.java:289)
at

Re: AW: Newbie in Solr

2017-03-27 Thread Shawn Heisey

On 3/27/2017 1:35 PM, Ercan Karadeniz wrote:
> is my understanding correct when I use the "managed-schema" file for
> the Solr confguration, then it is NOT running in schema less mode,
> correct?

Impossible to say from the info provided.

The managed schema is required for schemaless mode, but *all* example
configurations shipping with version 5.5 or newer are using the managed
schema, even those that are not schemaless.

If you're using the classic schema factory, and a file name schema.xml,
then you can be sure that schemaless is either not present or that it
will not function properly.

Thanks,
Shawn

Re: Licensing issue advice for Solr.

2017-03-27 Thread Shawn Heisey

On 3/24/2017 11:53 AM, russell.lemas...@comcast.net wrote:
> I'm just getting started with Solr (6.4.2) and am trying to get
> approval for usage in my workplace. I know that the product in general
> is licensed as Apache 2.0, but unfortunately there are packages
> included in the build that are considered "non-permissive" by my
> company and as such, means that I am having trouble getting things
> approved. It appears that the vast majority of the licensing issues
> are within the contrib directory. I know these provide significant
> functionality for Solr, but I was wondering if there is an official
> build that contains just the Solr and Lucene server distribution
> (minus demos and contrib). Some of the packages are dual licensed so I
> am able to deal with that by selecting which we wish to use, but there
> are some that are either not licensed at all or are only
> non-permissive (ie: not Apache, BSD, MIT, etc.) like GPL, CDDL, etc. 

The big questions, which Hoss already mentioned: What modules in Solr do
you need where the license is unacceptable, where are you looking to
confirm that the unacceptable license applies, and why are those
particular licenses unacceptable?

If something is included with Solr, then it's almost guaranteed that one
of the licenses for it will be compatible with the Apache 2.0 license,
and perfectly acceptable to use in a commercial setting.  The Apache
Software Foundation takes licenses seriously, and the Lucene/Solr
project is no exception.

The GPL is an example of something that is not compatible with the
Apache license.  This means that if something is ONLY licensed under the
GPL, including it with Solr is not allowed, and we need to remove it. 
Some of the libraries used by Solr's dependenciesare licensed under the
LGPL, which IS compatible with Apache 2.0.

The CDDL is mentioned in Apache's legal area as an acceptable license
for binary inclusion in an Apache project.  It will not conflict with
the Apache license.  This is particularly important with Solr, because
if you're not OK with the CDDL, you basically can't run Solr at all. 
The Java Servlet API is licensed CDDL.  This API is necessary in order
to create a servlet application, which is what Solr is.

https://www.apache.org/legal/resolved.html#category-b

Thanks,
Shawn

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman

This update seems suspicious,  the adds with the same id seem like a
closure issue in the retry.

---
solr1_1 | 2017-03-27 20:19:12.397 INFO (qtp575335780-17) [c:goseg s:shard24
r:core_node12 x:goseg_shard24_replica2] o.a.s.u.p.LogUpdateProcessorFactory
[goseg_shard24_replica2] webapp=/solr path=/update
params={update.distrib=FROMLEADER=
http://172.17.0.10:8983/solr/goseg_shard24_replica1/_rf=3=javabin=2}{add=[dev_list_segmentation_test_76661_recipients!batch4...@x.com
(1563055570139742208), dev_list_segmentation_test_76661_recipients!
batch4...@x.com (1563055570141839360),
dev_list_segmentation_test_76661_recipients!batch4...@x.com
(1563055570141839361), dev_list_segmentation_test_76661_recipients!
batch4...@x.com (1563055570142887936),
dev_list_segmentation_test_76661_recipients!batch4...@x.com
(1563055570143936512), dev_list_segmentation_test_76661_recipients!
batch4...@x.com (1563055570143936513),
dev_list_segmentation_test_76661_recipients!batch4...@x.com
(1563055570143936514), dev_list_segmentation_test_76661_recipients!
batch4...@x.com (1563055570143936515),
dev_list_segmentation_test_76661_recipients!batch4...@x.com
(1563055570144985088), dev_list_segmentation_test_76661_recipients!
batch4...@x.com (1563055570144985089)]} 0 23



On Mon, Mar 27, 2017 at 3:04 PM Shawn Feldman 
wrote:

> Here is the solr log of our test node restarting
>
> https://s3.amazonaws.com/uploads.hipchat.com/17705/1138911/fvKS3t5uAnoi0pP/solrlog.txt
>
>
>
> On Mon, Mar 27, 2017 at 2:10 PM Shawn Feldman 
> wrote:
>
> Ercan, I think you responded to the wrong thread
>
> On Mon, Mar 27, 2017 at 2:02 PM Ercan Karadeniz <
> ercan_karade...@hotmail.com> wrote:
>
> 6.4.2 (latest available) or shall I use another one for familiarization
> purposes?
>
>
> 
> Von: Alexandre Rafalovitch 
> Gesendet: Montag, 27. März 2017 21:28
> An: solr-user
> Betreff: Re: losing records during solr updates
>
> What version of Solr is it?
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> Home | Solr Start
> www.solr-start.com
> Welcome to the collection of resources to make Apache Solr search engine
> more comprehensible to beginner and intermediate users. While Solr is very
> easy to start with ...
>
>
>
>
>
> On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> > When we restart solr on a leader node while we are doing updates, we've
> > noticed that some small percentage of data is lost.  maybe 9 records out
> of
> > 1k.  Updating using min_rf=3 or full quorum seems to resolve this since
> our
> > rf = 3.  Updates then seem to only succeed when all nodes are back up.
> Why
> > would we see record loss during a node restart?  I assumed the
> transaction
> > log would get replayed.  We have a 4 node cluster with 24 shards.
> >
> > -shawn
>
>

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman

Here is the solr log of our test node restarting
https://s3.amazonaws.com/uploads.hipchat.com/17705/1138911/fvKS3t5uAnoi0pP/solrlog.txt



On Mon, Mar 27, 2017 at 2:10 PM Shawn Feldman 
wrote:

> Ercan, I think you responded to the wrong thread
>
> On Mon, Mar 27, 2017 at 2:02 PM Ercan Karadeniz <
> ercan_karade...@hotmail.com> wrote:
>
> 6.4.2 (latest available) or shall I use another one for familiarization
> purposes?
>
>
> 
> Von: Alexandre Rafalovitch 
> Gesendet: Montag, 27. März 2017 21:28
> An: solr-user
> Betreff: Re: losing records during solr updates
>
> What version of Solr is it?
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> Home | Solr Start
> www.solr-start.com
> Welcome to the collection of resources to make Apache Solr search engine
> more comprehensible to beginner and intermediate users. While Solr is very
> easy to start with ...
>
>
>
>
>
> On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> > When we restart solr on a leader node while we are doing updates, we've
> > noticed that some small percentage of data is lost.  maybe 9 records out
> of
> > 1k.  Updating using min_rf=3 or full quorum seems to resolve this since
> our
> > rf = 3.  Updates then seem to only succeed when all nodes are back up.
> Why
> > would we see record loss during a node restart?  I assumed the
> transaction
> > log would get replayed.  We have a 4 node cluster with 24 shards.
> >
> > -shawn
>
>

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman

Ercan, I think you responded to the wrong thread

On Mon, Mar 27, 2017 at 2:02 PM Ercan Karadeniz 
wrote:

> 6.4.2 (latest available) or shall I use another one for familiarization
> purposes?
>
>
> 
> Von: Alexandre Rafalovitch 
> Gesendet: Montag, 27. März 2017 21:28
> An: solr-user
> Betreff: Re: losing records during solr updates
>
> What version of Solr is it?
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> Home | Solr Start
> www.solr-start.com
> Welcome to the collection of resources to make Apache Solr search engine
> more comprehensible to beginner and intermediate users. While Solr is very
> easy to start with ...
>
>
>
>
>
> On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> > When we restart solr on a leader node while we are doing updates, we've
> > noticed that some small percentage of data is lost.  maybe 9 records out
> of
> > 1k.  Updating using min_rf=3 or full quorum seems to resolve this since
> our
> > rf = 3.  Updates then seem to only succeed when all nodes are back up.
> Why
> > would we see record loss during a node restart?  I assumed the
> transaction
> > log would get replayed.  We have a 4 node cluster with 24 shards.
> >
> > -shawn
>

AW: losing records during solr updates

2017-03-27 Thread Ercan Karadeniz

6.4.2 (latest available) or shall I use another one for familiarization 
purposes?

Von: Alexandre Rafalovitch 
Gesendet: Montag, 27. März 2017 21:28
An: solr-user
Betreff: Re: losing records during solr updates

What version of Solr is it?

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced
Home | Solr Start
www.solr-start.com
Welcome to the collection of resources to make Apache Solr search engine more 
comprehensible to beginner and intermediate users. While Solr is very easy to 
start with ...

On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> When we restart solr on a leader node while we are doing updates, we've
> noticed that some small percentage of data is lost.  maybe 9 records out of
> 1k.  Updating using min_rf=3 or full quorum seems to resolve this since our
> rf = 3.  Updates then seem to only succeed when all nodes are back up. Why
> would we see record loss during a node restart?  I assumed the transaction
> log would get replayed.  We have a 4 node cluster with 24 shards.
>
> -shawn

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman

6.4.2

On Mon, Mar 27, 2017 at 1:29 PM Alexandre Rafalovitch 
wrote:

> What version of Solr is it?
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> > When we restart solr on a leader node while we are doing updates, we've
> > noticed that some small percentage of data is lost.  maybe 9 records out
> of
> > 1k.  Updating using min_rf=3 or full quorum seems to resolve this since
> our
> > rf = 3.  Updates then seem to only succeed when all nodes are back up.
> Why
> > would we see record loss during a node restart?  I assumed the
> transaction
> > log would get replayed.  We have a 4 node cluster with 24 shards.
> >
> > -shawn
>

Unexplainable indexing i/o errors

2017-03-27 Thread simon

I'm seeing an odd error during indexing for which I can't find any reason.

The relevant solr log entry:

2017-03-24 19:09:35.363 ERROR (commitScheduler-30-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:java.io.EOFException: read past EOF:
 MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
 at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
...
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum
status indeterminate: remaining=0, please run checkindex for more details
(resource=
BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")))
 at
org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
 at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.(CompressingStoredFieldsReader.java:140)
 followed within a few seconds by

 2017-03-24 19:09:56.402 ERROR (commitScheduler-31-thread-1) [
x:build0324] o.a.s.u.CommitTracker auto commit
error...:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
...
Caused by: java.io.EOFException: read past EOF:
MMapIndexInput(path="/indexes/solrindexes/build0324/index/_4ku.fdx")
at
org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)

This error is repeated a few times as the indexing continued and further
autocommits were triggered.

I stopped the indexing process, made a backup snapshot of the index,
 restarted indexing at a checkpoint, and everything then completed without
further incidents

I ran checkIndex on the saved snapshot and it reported no errors
whatsoever. Operations on the complete index (inclcuing an optimize and
several query scripts) have all been error-free.

Some background:
 Solr information from the beginning of the checkindex output:
 ---
 Opening index @ /indexes/solrindexes/build0324.bad/index

Segments file=segments_9s numSegments=105 version=6.3.0
id=7m1ldieoje0m6sljp7xocbz9l userData={commitTimeMSec=1490400514324}
  1 of 105: name=_be maxDoc=1227144
version=6.3.0
id=7m1ldieoje0m6sljp7xocburb
codec=Lucene62
compound=false
numFiles=14
size (MB)=4,926.186
diagnostics = {os=Linux, java.vendor=Oracle Corporation,
java.version=1.8.0_45, java.vm.version=25.45-b02, lucene.version=6.3.0,
mergeMaxNumSegments=-1, os.arch=amd64, java.runtime.version=1.8.0_45-b13,
source=merge, mergeFactor=19, os.version=3.10.0-229.1.2.el7.x86_64,
timestamp=1490380905920}
no deletions
test: open reader.OK [took 0.176 sec]
test: check integrity.OK [took 37.399 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [49 fields] [took 0.000 sec]
test: field norms.OK [17 fields] [took 0.030 sec]
test: terms, freq, prox...OK [14568108 terms; 612537186 terms/docs
pairs; 801208966 tokens] [took 30.005 sec]
test: stored fields...OK [150164874 total field count; avg 122.4
fields per doc] [took 35.321 sec]
test: term vectorsOK [4804967 total term vector count; avg 3.9
term/freq vector fields per doc] [took 55.857 sec]
test: docvalues...OK [4 docvalues fields; 0 BINARY; 1 NUMERIC;
2 SORTED; 0 SORTED_NUMERIC; 1 SORTED_SET] [took 0.954 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]
  -

  The indexing process is a Python script (using the scorched Python
client)  which spawns multiple instance of itself, in this case 6, so there
are definitely concurrent calls ( to /update/json )

Solrconfig and the schema have not been changed for several months, during
which time many ingests have been done, and the documents which were being
indexed at the time of the error have been indexed before without problems,
so I don't think it's a data issue.

I saw the same error occur earlier in the day, and decided at that time to
delete the core and restart the Solr instance.

The server is an Amazon instance running CentOS 7. I checked the system
logs and didn't see any evidence of hardware errors

I'm puzzled as to why this would start happening out of the blue and I
can't find any partiuclarly relevant posts to this forum or Stackexchange.
Anyone have an idea what's going on ?

-Simon

AW: Newbie in Solr

2017-03-27 Thread Ercan Karadeniz

Hi Alexandre,

is my understanding correct when I use the "managed-schema" file for the Solr 
confguration, then it is NOT running in schema less mode, correct?

Regards,

Ercan

Von: Alexandre Rafalovitch 
Gesendet: Freitag, 24. März 2017 01:00
An: solr-user
Betreff: Re: Newbie in Solr

Glad to hear you liked my site. You can find the truly minimal
(non-production) example at https://github.com/arafalov/simplest-solr-config
[https://avatars0.githubusercontent.com/u/64153?v=3=400]

GitHub - arafalov/simplest-solr-config: The smallest Solr 
...
github.com
simplest-solr-config - The smallest Solr config possible as a counter-balance 
to the example that ships with Solr itself

. It is not that scary.

If you are looking at the database import, you may also want to review my
work in progress on simplifying DIH DB example at:
https://issues.apache.org/jira/browse/SOLR-10312 (need to change
luceneMatchVersion in solrconfig.xml).

There is also a lot of eCommerce integration with Solr out there, but not
for this system. More for eZ Commerce, I think.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced
Home | Solr Start
www.solr-start.com
Welcome to the collection of resources to make Apache Solr search engine more 
comprehensible to beginner and intermediate users. While Solr is very easy to 
start with ...

On 23 March 2017 at 16:37, Ercan Karadeniz 
wrote:

> Hi All,
>
>
> I'm a newbie in Solr.
>
>
> I have the task to replace the built-in search functionality of a online
> shop system (xtcmodified commerce, it's a German online shop system =>
> https://www.modified-shop.org/) with Solr.
[https://www.modified-shop.org/forum/Themes/modified/images/logo.png]

modified eCommerce Shopsoftware - kostenloser OpenSource 
...
www.modified-shop.org
Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL 
veröffentlichtes OpenSource Shopsystem

> modified eCommerce Shopsoftware - kostenloser OpenSource ...
> 
[https://www.modified-shop.org/forum/Themes/modified/images/logo.png]

modified eCommerce Shopsoftware - kostenloser OpenSource 
...
www.modified-shop.org
Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL 
veröffentlichtes OpenSource Shopsystem

> www.modified-shop.org
[http://www.modified-shop.org/forum/Themes/modified/images/logo.png]

modified eCommerce Shopsoftware - kostenloser OpenSource 
...
www.modified-shop.org
Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL 
veröffentlichtes OpenSource Shopsystem

> Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL
> veröffentlichtes OpenSource Shopsystem
>
>
> Currently I'm trying to get familiarized with Solr 6 and understand the
> operations of Solr.
>
>
> I found the webseite solr-start.com from Alexandre, the video from him
> was very useful for the introduction. On the other hand it scared me a
> little bit since Solr has too many configuration parameters [image: ]
>
>
> I have performed the installation and I'm currently analyzing the
> available examples.
>
>
> As a next steps I need to export the data from the mysql database to Solr.
>
>
> Does anyone here have experience with Solr and e-Commerce Integration.
>
>
> Any example or best practices which you can share with me?
>
>
> Any feedback is welcome. Thanks in advance!
>
>
> Best regards,
>
> Ercan
>
>
>
>
>
>
>

Re: losing records during solr updates

2017-03-27 Thread Shawn Feldman

We are also hard committing at 15 sec and soft committing at 30 sec.  I've
found if we change syncLevel to fsync then we don't lose any data

On Mon, Mar 27, 2017 at 1:30 PM Shawn Feldman 
wrote:

> 6.4.2
>
> On Mon, Mar 27, 2017 at 1:29 PM Alexandre Rafalovitch 
> wrote:
>
> What version of Solr is it?
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> > When we restart solr on a leader node while we are doing updates, we've
> > noticed that some small percentage of data is lost.  maybe 9 records out
> of
> > 1k.  Updating using min_rf=3 or full quorum seems to resolve this since
> our
> > rf = 3.  Updates then seem to only succeed when all nodes are back up.
> Why
> > would we see record loss during a node restart?  I assumed the
> transaction
> > log would get replayed.  We have a 4 node cluster with 24 shards.
> >
> > -shawn
>
>

Re: losing records during solr updates

2017-03-27 Thread Alexandre Rafalovitch

What version of Solr is it?

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 27 March 2017 at 15:25, Shawn Feldman  wrote:
> When we restart solr on a leader node while we are doing updates, we've
> noticed that some small percentage of data is lost.  maybe 9 records out of
> 1k.  Updating using min_rf=3 or full quorum seems to resolve this since our
> rf = 3.  Updates then seem to only succeed when all nodes are back up. Why
> would we see record loss during a node restart?  I assumed the transaction
> log would get replayed.  We have a 4 node cluster with 24 shards.
>
> -shawn

losing records during solr updates

2017-03-27 Thread Shawn Feldman

When we restart solr on a leader node while we are doing updates, we've
noticed that some small percentage of data is lost.  maybe 9 records out of
1k.  Updating using min_rf=3 or full quorum seems to resolve this since our
rf = 3.  Updates then seem to only succeed when all nodes are back up. Why
would we see record loss during a node restart?  I assumed the transaction
log would get replayed.  We have a 4 node cluster with 24 shards.

-shawn

Re: Solr Delete By Id Out of memory issue

2017-03-27 Thread Rohit Kanchan

Thanks Erick for replying back. I have deployed changes to production, we
will figure it out soon if it is still causing OOM or not. And for commits
we are doing auto commits after 10K docs or 30 secs.
If I get time I will try to run a local test to check if we will hit OOM
because of 1K map entries or not. I will update this thread about my
findings. I really appreciate yours and Chris response.

Thanks
Rohit


On Mon, Mar 27, 2017 at 10:47 AM, Erick Erickson 
wrote:

> Rohit:
>
> Well, whenever I see something like "I have this custom component..."
> I immediately want the problem to be demonstrated without that custom
> component before trying to debug Solr.
>
> As Chris explained, we can't clear the 1K entries. It's hard to
> imagine why keeping the last 1,000 entries around would cause OOMs.
>
> You haven't demonstrated yet that after your latest change you still
> get OOMs, you've just assumed so. After running for a "long time" do
> you still see the problem after your changes?
>
> So before assuming it's a Solr bug, and after you demonstrate that
> your latest change didn't solve the problem, you should try two
> things:
>
> 1> as I suggested and Chris endorsed, try committing upon occasion
> from your custom component. Or set your autocommit settings
> appropriately if you haven't already.
>
> 2> run your deletes from the client as a test. You've created a custom
> URP component because you "didn't want to run the queries from the
> client". That's perfectly reasonable, it's just that to know where you
> should be looking deleting from the client would eliminate your custom
> code and tell us where to focus.
>
> Best,
> Erick
>
>
>
> On Sat, Mar 25, 2017 at 1:21 PM, Rohit Kanchan 
> wrote:
> > I think we figure out the issue, When we were conventing delete by query
> in
> > a Solr Handler we were not making a deep copy of BytesRef. We were making
> > reference of same object, which was causing old deletes(LinkedHasmap)
> > adding more than 1K entries.
> >
> > But I think it is still not clearing those 1K entries. Eventually it will
> > throw OOM because UpdateLog is not singleton and when there will be many
> > delete by id and server is not re started for very long time then
> > eventually throw OOM. I think we should clear this map when we are
> > committing. I am not a committer,  it would be great if I get reply from
> a
> > committer.  What do you guys think?
> >
> > Thanks
> > Rohit
> >
> >
> > On Wed, Mar 22, 2017 at 1:36 PM, Rohit Kanchan 
> > wrote:
> >
> >> For commits we are relying on auto commits. We have define following in
> >> configs:
> >>
> >>
> >>
> >> 1
> >>
> >> 3
> >>
> >> false
> >>
> >> 
> >>
> >> 
> >>
> >> 15000
> >>
> >> 
> >>
> >> One thing which I would like to mention is that we are not calling
> >> directly deleteById from client. We have created an  update chain and
> added
> >> a processor there. In this processor we are querying first and
> collecting
> >> all byteRefHash and get each byteRef out of it and set it to indexedId.
> >> After collecting indexedId we are using those ids to call delete byId.
> We
> >> are doing this because we do not want query solr before deleting at
> client
> >> side. It is possible that there is a bug in this code but I am not sure,
> >> because when I run tests in my local it is not showing any issues. I am
> >> trying to remote debug now.
> >>
> >> Thanks
> >> Rohit
> >>
> >>
> >> On Wed, Mar 22, 2017 at 9:57 AM, Chris Hostetter <
> hossman_luc...@fucit.org
> >> > wrote:
> >>
> >>>
> >>> : OK, The whole DBQ thing baffles the heck out of me so this may be
> >>> : totally off base. But would committing help here? Or at least be
> worth
> >>> : a test?
> >>>
> >>> ths isn't DBQ -- the OP specifically said deleteById, and that the
> >>> oldDeletes map (only used for DBI) was the problem acording to the heap
> >>> dumps they looked at.
> >>>
> >>> I suspect you are correct about the root cause of the OOMs ... perhaps
> the
> >>> OP isn't using hard/soft commits effectively enough and the uncommitted
> >>> data is what's causing the OOM ... hard to say w/o more details. or
> >>> confirmation of exactly what the OP was looking at in their claim below
> >>> about the heap dump
> >>>
> >>>
> >>> : > : Thanks for replying. We are using Solr 6.1 version. Even I saw
> that
> >>> it is
> >>> : > : bounded by 1K count, but after looking at heap dump I was amazed
> >>> how can it
> >>> : > : keep more than 1K entries. But Yes I see around 7M entries
> >>> according to
> >>> : > : heap dump and around 17G of memory occupied by BytesRef there.
> >>> : >
> >>> : > what exactly are you looking at when you say you see "7M entries" ?
> >>> : >
> >>> : > are you sure you aren't confusing the keys in oldDeletes with other
> >>> : > instances of BytesRef in the JVM?
> >>>
> >>>
> >>> -Hoss
>

Re: Solr Delete By Id Out of memory issue

2017-03-27 Thread Erick Erickson

Rohit:

Well, whenever I see something like "I have this custom component..."
I immediately want the problem to be demonstrated without that custom
component before trying to debug Solr.

As Chris explained, we can't clear the 1K entries. It's hard to
imagine why keeping the last 1,000 entries around would cause OOMs.

You haven't demonstrated yet that after your latest change you still
get OOMs, you've just assumed so. After running for a "long time" do
you still see the problem after your changes?

So before assuming it's a Solr bug, and after you demonstrate that
your latest change didn't solve the problem, you should try two
things:

1> as I suggested and Chris endorsed, try committing upon occasion
from your custom component. Or set your autocommit settings
appropriately if you haven't already.

2> run your deletes from the client as a test. You've created a custom
URP component because you "didn't want to run the queries from the
client". That's perfectly reasonable, it's just that to know where you
should be looking deleting from the client would eliminate your custom
code and tell us where to focus.

Best,
Erick



On Sat, Mar 25, 2017 at 1:21 PM, Rohit Kanchan  wrote:
> I think we figure out the issue, When we were conventing delete by query in
> a Solr Handler we were not making a deep copy of BytesRef. We were making
> reference of same object, which was causing old deletes(LinkedHasmap)
> adding more than 1K entries.
>
> But I think it is still not clearing those 1K entries. Eventually it will
> throw OOM because UpdateLog is not singleton and when there will be many
> delete by id and server is not re started for very long time then
> eventually throw OOM. I think we should clear this map when we are
> committing. I am not a committer,  it would be great if I get reply from a
> committer.  What do you guys think?
>
> Thanks
> Rohit
>
>
> On Wed, Mar 22, 2017 at 1:36 PM, Rohit Kanchan 
> wrote:
>
>> For commits we are relying on auto commits. We have define following in
>> configs:
>>
>>
>>
>> 1
>>
>> 3
>>
>> false
>>
>> 
>>
>> 
>>
>> 15000
>>
>> 
>>
>> One thing which I would like to mention is that we are not calling
>> directly deleteById from client. We have created an  update chain and added
>> a processor there. In this processor we are querying first and collecting
>> all byteRefHash and get each byteRef out of it and set it to indexedId.
>> After collecting indexedId we are using those ids to call delete byId. We
>> are doing this because we do not want query solr before deleting at client
>> side. It is possible that there is a bug in this code but I am not sure,
>> because when I run tests in my local it is not showing any issues. I am
>> trying to remote debug now.
>>
>> Thanks
>> Rohit
>>
>>
>> On Wed, Mar 22, 2017 at 9:57 AM, Chris Hostetter > > wrote:
>>
>>>
>>> : OK, The whole DBQ thing baffles the heck out of me so this may be
>>> : totally off base. But would committing help here? Or at least be worth
>>> : a test?
>>>
>>> ths isn't DBQ -- the OP specifically said deleteById, and that the
>>> oldDeletes map (only used for DBI) was the problem acording to the heap
>>> dumps they looked at.
>>>
>>> I suspect you are correct about the root cause of the OOMs ... perhaps the
>>> OP isn't using hard/soft commits effectively enough and the uncommitted
>>> data is what's causing the OOM ... hard to say w/o more details. or
>>> confirmation of exactly what the OP was looking at in their claim below
>>> about the heap dump
>>>
>>>
>>> : > : Thanks for replying. We are using Solr 6.1 version. Even I saw that
>>> it is
>>> : > : bounded by 1K count, but after looking at heap dump I was amazed
>>> how can it
>>> : > : keep more than 1K entries. But Yes I see around 7M entries
>>> according to
>>> : > : heap dump and around 17G of memory occupied by BytesRef there.
>>> : >
>>> : > what exactly are you looking at when you say you see "7M entries" ?
>>> : >
>>> : > are you sure you aren't confusing the keys in oldDeletes with other
>>> : > instances of BytesRef in the JVM?
>>>
>>>
>>> -Hoss
>>> http://www.lucidworks.com/
>>>
>>
>>

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.

See also:

http://stackoverflow.com/a/39792337/6281268

This includes jai.

Most importantly: be aware of the licensing implications of using levigo and 
jai.  If they had been Apache 2.0 compatible, we would have included them.

Finally, there's a new option (coming out in Tika 1.15) that renders each PDF 
page as a single image before running OCR on it.  We found a couple of crazy 
PDFs that had 1000s of images where a single image was used to represent one 
line in a table (and I don't mean row, I mean a literal line in a table).

That "new" option is documented on our wiki:

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

Finally (I mean it this time), I've updated our wiki to mention the two 
optional dependencies.  Thank you.

Cheers,

  Tim

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Monday, March 27, 2017 11:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

I tried this solution from Tim Allison, and it works.

http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files

Regards,
Edwin

On 27 March 2017 at 20:07, Allison, Timothy B.  wrote:

> Please also see:
>
> https://wiki.apache.org/tika/TikaOCR
>
> and
>
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
>
> If you have any other questions about Apache Tika and OCR, please feel 
> free to ask on our users list as well: u...@tika.apache.org
>
> Cheers,
>
>Tim
>
> -Original Message-
> From: Arian Pasquali [mailto:arianpasqu...@gmail.com]
> Sent: Sunday, March 26, 2017 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Index scanned documents
>
> Hi Walled,
>
> I've never done that with solr, but you would probably need to use 
> some OCR preprocessing before indexing.
> The most popular library I know for the job is tesseract-orc < 
> https://github.com/tesseract-ocr>.
>
> If you want to do that inside solr I've found that Tika has some 
> support for that too.
> Take a look Vijay Mhaskar's post on how to do this using TikaOCR
>
> http://blog.thedigitalgroup.com/vijaym/using-solr-and-
> tikaocr-to-search-text-inside-an-image/
>
> I hope that guides you
>
> Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
> waleed.raza.parhi...@gmail.com> escreveu:
>
> > Hello
> > I want to ask you that how can we extract text in solr from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> > waleed.raza.parhi...@gmail.com
> > > wrote:
> >
> > > Hello
> > > I want to ask you that how can we extract in solr text from images 
> > > which are inside pdf and MS office documents ?
> > > i found many websites but did not get a reply of it please guide me.
> > >
> > >
> >
> --
> [image: INESC TEC]
>
> *Arian Rodrigo Pasquali*
> Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
> Artificial Intelligence and Decision Support
>
> *INESC TEC*
> Campus da FEUP
> Rua Dr Roberto Frias
> 4200-465 Porto
> Portugal
>
> T +351 22 040 2963
> F +351 22 209 4050
> arian.r.pasqu...@inesctec.pt
> www.inesctec.pt
>

Re: Index scanned documents

2017-03-27 Thread Zheng Lin Edwin Yeo

I tried this solution from Tim Allison, and it works.

http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files

Regards,
Edwin

On 27 March 2017 at 20:07, Allison, Timothy B.  wrote:

> Please also see:
>
> https://wiki.apache.org/tika/TikaOCR
>
> and
>
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
>
> If you have any other questions about Apache Tika and OCR, please feel
> free to ask on our users list as well: u...@tika.apache.org
>
> Cheers,
>
>Tim
>
> -Original Message-
> From: Arian Pasquali [mailto:arianpasqu...@gmail.com]
> Sent: Sunday, March 26, 2017 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Index scanned documents
>
> Hi Walled,
>
> I've never done that with solr, but you would probably need to use some
> OCR preprocessing before indexing.
> The most popular library I know for the job is tesseract-orc <
> https://github.com/tesseract-ocr>.
>
> If you want to do that inside solr I've found that Tika has some support
> for that too.
> Take a look Vijay Mhaskar's post on how to do this using TikaOCR
>
> http://blog.thedigitalgroup.com/vijaym/using-solr-and-
> tikaocr-to-search-text-inside-an-image/
>
> I hope that guides you
>
> Em dom, 26 de mar de 2017 às 16:09, Waleed Raza <
> waleed.raza.parhi...@gmail.com> escreveu:
>
> > Hello
> > I want to ask you that how can we extract text in solr from images
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza <
> > waleed.raza.parhi...@gmail.com
> > > wrote:
> >
> > > Hello
> > > I want to ask you that how can we extract in solr text from images
> > > which are inside pdf and MS office documents ?
> > > i found many websites but did not get a reply of it please guide me.
> > >
> > >
> >
> --
> [image: INESC TEC]
>
> *Arian Rodrigo Pasquali*
> Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of
> Artificial Intelligence and Decision Support
>
> *INESC TEC*
> Campus da FEUP
> Rua Dr Roberto Frias
> 4200-465 Porto
> Portugal
>
> T +351 22 040 2963
> F +351 22 209 4050
> arian.r.pasqu...@inesctec.pt
> www.inesctec.pt
>

Re: Schema API: Modify Unique Key

2017-03-27 Thread Shawn Heisey

On 3/27/2017 7:05 AM, nabil Kouici wrote:
> We're going to use Solr in our organization (under test) and we want
> to set the primary key through schema API, which is not allowed today.
> Is this function planned to be implemented in Solr? If yes, do you
> have any idea in which version? 

Steve Rowe has been working on it, as he mentioned.  I have asked him a
question via the SOLR-7242 issue.

I can think of two reasons that this functionality has NOT been written yet:

1) In Cloud mode on a distributed index, it is unlikely that the
existing collection will have the documents in the correct shards. 
Entirely reindexing is strongly recommended in these situations.

2) Before changing the uniqueKey, you must be absolutely certain that
the field is the appropriate type and that the field does not contain
the same value more than once.  If this is not the case, Solr will not
behave correctly.

Thanks,
Shawn

Re: Streaming expressions - Any plans to add one to many fetches to the fetch decorator?

2017-03-27 Thread Joel Bernstein

Yes, one to many fetches will be implemented.

At the moment there isn't a workaround that I can think of.

If you decide to work on a patch for fetch I'll review the patch.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 27, 2017 at 2:33 PM, adfel70  wrote:

> Any ideas how to workaround this with the current streaming capabilities?
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Streaming-expressions-Any-plans-to-add-
> one-to-many-fetches-to-the-fetch-decorator-tp4326989.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Multi word synonyms

2017-03-27 Thread Doug Turnbull

Fntastic!
On Mon, Mar 27, 2017 at 9:56 AM alessandro.benedetti 
wrote:

> In addition to what Doug has already pointed out, i would like to highlight
> this contribution in Solr 6.5.0 .
> It may seem like a small innocent patch but it actually open a new worlds
> for one of the most controversial aspects of Solr Query Parsing :
>
> http://issues.apache.org/jira/browse/SOLR-9185
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp4326863p4326998.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.175.190.72:8999/solr/product: Rollback is currently not supported in SolrCloud mode. (SOLR-4895

2017-03-27 Thread Shawn Heisey

On 3/27/2017 4:37 AM, Mikhail Ibraheem wrote:
> Any help please?
>
> -Original Message-
> From: Mikhail Ibraheem 
> Sent: 26 مارس, 2017 10:22 م
> To: solr-user@lucene.apache.org
> Subject: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://10.175.190.72:8999/solr/product: Rollback is currently 
> not supported in SolrCloud mode. (SOLR-4895)
>
> Hi,
>
> When I try to rollback in solrCloud I get this exception :

That IS what the message says.

SolrCloud almost universally means that the user is running a
distributed or replicated index, quite possibly both.  Some of the
really smart people working on the software determined that rollback
cannot be supported under these conditions.  The message also references
a Solr issue where the message was created:

https://issues.apache.org/jira/browse/SOLR-4895

The message was added before the 5.0 release.

Thanks,
Shawn

Re: Multi word synonyms

2017-03-27 Thread alessandro.benedetti

In addition to what Doug has already pointed out, i would like to highlight
this contribution in Solr 6.5.0 .
It may seem like a small innocent patch but it actually open a new worlds
for one of the most controversial aspects of Solr Query Parsing :

http://issues.apache.org/jira/browse/SOLR-9185

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp4326863p4326998.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Schema API: Modify Unique Key

2017-03-27 Thread Steve Rowe

Hi Nabil,

There is an open JIRA issue to implement this functionality, but I haven’t had 
a chance to work on it recently: 
.  Consequently, I’m not sure 
which release will have it.

Patches welcome!

--
Steve
www.lucidworks.com

> On Mar 27, 2017, at 9:05 AM, nabil Kouici  wrote:
> 
> Hi All,
> 
> 
> 
> We're going to use Solr in our organization (under test) and we want to set 
> the primary key through schema API, which is not allowed today. Is this 
> function planned to be implemented in Solr? If yes, do you have any idea in 
> which version?
> Regards,Nabil.   
>

Re: Version upgrading approaches

2017-03-27 Thread alessandro.benedetti

Based on what I noticed so far, the strongest drive for a migration is a new
feature coming/ bugfix coming.
It's usually the only way to convince the business layer in small/mid size
companies not tech oriented.

In general I would say it is quite important to avoid lagging to much behind
( keeping the difference in major version <2 ).
Last observation based on experience, never just migrate to the
latest/recent version without a deep check of the changelists and community
resources.
It is quite common that a version 6.x.0 is released and then some regression
or bug is discovered and 6.x.1 ( sometimes 6.x.2 is released) .
So it always good to investigate the community and changelists to find the
minimum (safe) Solr Version .

In relation to that, I have actually a concern as there are some "famous"
Solr versions affected by bugs.
I don't know what happens in those cases but I would like to see those
releases impossible to download/install after the bug as been fixed ( I
think a recent example was 6.4.0 which has some big regression).


Regards




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Version-upgrading-approaches-tp4326976p4326993.html
Sent from the Solr - User mailing list archive at Nabble.com.

Streaming expressions - Any plans to add one to many fetches to the fetch decorator?

2017-03-27 Thread adfel70

Any ideas how to workaround this with the current streaming capabilities?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Streaming-expressions-Any-plans-to-add-one-to-many-fetches-to-the-fetch-decorator-tp4326989.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Classify document using bag of words

2017-03-27 Thread alessandro.benedetti

Hi marotosg,
john's suggestion will definitely work ( I recommend you a copyfield for
that analysis).

What happens in your use case if a word is in common for more than one bag
of word  ( if possible at all in your use case)?
Do you expect to get back all the classes ? scored in some way ?

In  that case you may need a different approach, and the Solr Document
Classification should help.
At the moment the only available integration is the indexing time one (
which means you don't have control on human validation, Solr is going to
assign the class( or classes) and you just decide the output field.
Documentation was not very up to date, I just updated it [1]

In case you like a different approach ( including human validation), there a
Jira issue for a request handler approach, that could be called by your
indexing application and ask for human feedback before the document is sent
to solr, a contribution is welcome ! [2] .

Cheers

[1]  https://wiki.apache.org/solr/SolrClassification
[2]  https://issues.apache.org/jira/browse/SOLR-7738



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Classify-document-using-bag-of-words-tp4326865p4326988.html
Sent from the Solr - User mailing list archive at Nabble.com.

Schema API: Modify Unique Key

2017-03-27 Thread nabil Kouici

Hi All,



We're going to use Solr in our organization (under test) and we want to set the 
primary key through schema API, which is not allowed today. Is this function 
planned to be implemented in Solr? If yes, do you have any idea in which 
version?
Regards,Nabil.

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.

Please also see: 

https://wiki.apache.org/tika/TikaOCR

and

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR

If you have any other questions about Apache Tika and OCR, please feel free to 
ask on our users list as well: u...@tika.apache.org

Cheers,

   Tim

-Original Message-
From: Arian Pasquali [mailto:arianpasqu...@gmail.com] 
Sent: Sunday, March 26, 2017 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents

Hi Walled,

I've never done that with solr, but you would probably need to use some OCR 
preprocessing before indexing.
The most popular library I know for the job is tesseract-orc 
.

If you want to do that inside solr I've found that Tika has some support for 
that too.
Take a look Vijay Mhaskar's post on how to do this using TikaOCR

http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/

I hope that guides you

Em dom, 26 de mar de 2017 às 16:09, Waleed Raza < 
waleed.raza.parhi...@gmail.com> escreveu:

> Hello
> I want to ask you that how can we extract text in solr from images 
> which are inside pdf and MS office documents ?
> i found many websites but did not get a reply of it please guide me.
>
> On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza < 
> waleed.raza.parhi...@gmail.com
> > wrote:
>
> > Hello
> > I want to ask you that how can we extract in solr text from images 
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> >
>
--
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of 
Artificial Intelligence and Decision Support

*INESC TEC*
Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto
Portugal

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasqu...@inesctec.pt
www.inesctec.pt

Version upgrading approaches

2017-03-27 Thread John Blythe

Hi all.

The new versions of solr come out in pretty regular fashion. We are
currently on 6.0. I'm curious what drives you / your team to run the
upgrades when you do. Particular features or patches you're eyeballing?
Only concerned w major releases? Some other protocol that is set internally?
-- 
-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-27 Thread Emir Arnautovic


It seems to me that you are looking for Solr's highlighting functionality:

https://cwiki.apache.org/confluence/display/solr/Highlighting

HTH,
Emir


On 27.03.2017 09:09, forest_soup wrote:

We are going to implement a feature:
When opening a document whose body field is already indexed in Solr,  if we
issued a keyword search before opening the doc, highlight the keyword in the
opening document.

That needs the position/offset info of the keyword in the doc's index, which
I think can be indexed or stored in solr in anyway. And we are searching
ways to retrieve them from any solr api.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

RE: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.175.190.72:8999/solr/product: Rollback is currently not supported in SolrCloud mode. (SOLR-4895

2017-03-27 Thread Mikhail Ibraheem

Any help please?

-Original Message-
From: Mikhail Ibraheem 
Sent: 26 مارس, 2017 10:22 م
To: solr-user@lucene.apache.org
Subject: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at http://10.175.190.72:8999/solr/product: Rollback is 
currently not supported in SolrCloud mode. (SOLR-4895)

Hi,

When I try to rollback in solrCloud I get this exception :

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
https://urldefense.proofpoint.com/v2/url?u=http-3A__10.175.190.72-3A8999_solr_product-3A=DwIFAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=iW5_r_AwhIPBqFLvIPIecWD-lLnT70LvSsXnZKGgSX8=hVLPzUd8QjwnoE7Y9yoJKG9YZQe3_nETyke1bVc8dqY=PE_Nb4zAeYm0QbCTNEFnfOYlOMpEEBMxg06JEprhhKA=
  Rollback is currently not supported in SolrCloud mode. (SOLR-4895)

Does that mean there is no rollback with solrCloud?

Please advise.

Thanks

Mikhail

[ANNOUNCE] Apache Solr 6.5.0 released

2017-03-27 Thread jim ferenczi

27 March 2017, Apache Solr 6.5.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.5.0.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

Solr 6.5.0 is available for immediate download at:

   -

   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Please read CHANGES.txt for a full list of new features and changes:

   -

   https://lucene.apache.org/solr/6_5_0/changes/Changes.html

Highlights of this Solr release include:

   - PointFields (fixed-width multi-dimensional numeric & binary types
   enabling fast range search) are now supported
   - In-place updates to numeric docValues fields (single valued,
   non-stored, non-indexed) supported using atomic update syntax
   - A new LatLonPointSpatialField that uses points or doc values for query
   - It is now possible to declare a field as "large" in order to bypass
   the document cache
   - New sow=false request param (split-on-whitespace) for edismax &
   standard query parsers enables query-time multi-term synonyms
   - XML QueryParser (defType=xmlparser) now supports span queries
   - hl.maxAnalyzedChars now have consistent default across highlighters
   - UnifiedSolrHighlighter and PostingsSolrHighlighter now support
   CustomSeparatorBreakIterator
   - Scoring formula is adjusted for the scoreNodes function
   - Calcite Planner now applies constant Reduction Rules to optimize plans
   - A new significantTerms Streaming Expression that is able to extract
   the significant terms in an index
   - StreamHandler is now able to use runtimeLib jars
   - Arithmetic operations are added to the SelectStream
   - Added modernized self-documenting /v2 API
   - The .system collection is now created on first request if it does not
   exist
   - Admin UI: Added shard deletion button
   - Metrics API now supports non-numeric metrics (version, disk type,
   component state, system properties...)
   - The disk free and aggregated disk free metrics are now reported
   - The DirectUpdateHandler2 now implements MetricsProducer and exposes
   stats via the metrics api and configured reporters.
   - BlockCache is faster due to less failures when caching a new block
   - MMapDirectoryFactory now supports "preload" option to ask mapped pages
   to be loaded into physical memory on init
   - Security: BasicAuthPlugin now supports standalone mode
   - Arbitrary java system properties can be passed to zkcli
   - SolrHttpClientBuilder can be configured via java system property
   - Javadocs and Changes.html are no longer included in the binary
   distribution, but are hosted online

Further details of changes are available in the change log available at:
http://lucene.apache.org/solr/6_5_0/changes/Changes.html

Please report any feedback to the mailing lists (http://lucene.apache.org/
solr/discussion.html)
Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

   -

Is there a way to retrieve the a term's position/offset in Solr

2017-03-27 Thread forest_soup

We are going to implement a feature: 
When opening a document whose body field is already indexed in Solr,  if we
issued a keyword search before opening the doc, highlight the keyword in the
opening document. 

That needs the position/offset info of the keyword in the doc's index, which
I think can be indexed or stored in solr in anyway. And we are searching
ways to retrieve them from any solr api.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931.html
Sent from the Solr - User mailing list archive at Nabble.com.

AW: Newbie in Solr

2017-03-27 Thread Ercan Karadeniz

Hi Alexandre,

thanks for your response.

I will check the provided URLs. Probably I will bother you with questions.

Cheers,

Ercan

Von: Alexandre Rafalovitch 
Gesendet: Freitag, 24. März 2017 01:00
An: solr-user
Betreff: Re: Newbie in Solr

Glad to hear you liked my site. You can find the truly minimal
(non-production) example at https://github.com/arafalov/simplest-solr-config
[https://avatars0.githubusercontent.com/u/64153?v=3=400]

GitHub - arafalov/simplest-solr-config: The smallest Solr 
...
github.com
simplest-solr-config - The smallest Solr config possible as a counter-balance 
to the example that ships with Solr itself

. It is not that scary.

If you are looking at the database import, you may also want to review my
work in progress on simplifying DIH DB example at:
https://issues.apache.org/jira/browse/SOLR-10312 (need to change
luceneMatchVersion in solrconfig.xml).

There is also a lot of eCommerce integration with Solr out there, but not
for this system. More for eZ Commerce, I think.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced
Home | Solr Start
www.solr-start.com
Welcome to the collection of resources to make Apache Solr search engine more 
comprehensible to beginner and intermediate users. While Solr is very easy to 
start with ...

On 23 March 2017 at 16:37, Ercan Karadeniz 
wrote:

> Hi All,
>
>
> I'm a newbie in Solr.
>
>
> I have the task to replace the built-in search functionality of a online
> shop system (xtcmodified commerce, it's a German online shop system =>
> https://www.modified-shop.org/) with Solr.
[https://www.modified-shop.org/forum/Themes/modified/images/logo.png]

modified eCommerce Shopsoftware - kostenloser OpenSource 
...
www.modified-shop.org
Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL 
veröffentlichtes OpenSource Shopsystem

> modified eCommerce Shopsoftware - kostenloser OpenSource ...
> 
[https://www.modified-shop.org/forum/Themes/modified/images/logo.png]

modified eCommerce Shopsoftware - kostenloser OpenSource 
...
www.modified-shop.org
Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL 
veröffentlichtes OpenSource Shopsystem

> www.modified-shop.org
> Die modified eCommerce Shopsoftware ist ein kostenloses unter der GPL
> veröffentlichtes OpenSource Shopsystem
>
>
> Currently I'm trying to get familiarized with Solr 6 and understand the
> operations of Solr.
>
>
> I found the webseite solr-start.com from Alexandre, the video from him
> was very useful for the introduction. On the other hand it scared me a
> little bit since Solr has too many configuration parameters [image: ]
>
>
> I have performed the installation and I'm currently analyzing the
> available examples.
>
>
> As a next steps I need to export the data from the mysql database to Solr.
>
>
> Does anyone here have experience with Solr and e-Commerce Integration.
>
>
> Any example or best practices which you can share with me?
>
>
> Any feedback is welcome. Thanks in advance!
>
>
> Best regards,
>
> Ercan
>
>
>
>
>
>
>

38 matches

Mail list logo