Solarium Extension

2014-08-07 Thread pushkar sawant
Hi All,
I have done installation of Solarium Search on Magento 1.7 ver. my Solr 4.9
is also working in background.
My Base OS is Ubuntu 13.10 on which solr 4.9 is running.
when i go  check the extension in magento admin it only shows Test
Connection.
Please find attached image.

Note - When Installing Extension on Magento through Content Manger it shows
Installation done. But gives error please find attached img. for same.


Data Import handler and join select

2014-08-07 Thread Alejandro Marqués Rodríguez
Hi,

I have one problem while indexing with data import hadler while doing a
join select. I have two tables, one with products and another one with
descriptions for each product in several languages.

So it would be:

Products: ID, NAME, BRAND, PRICE, ...
Descriptions: ID, LANGUAGE, DESCRIPTION

I would like to have every product indexed as a document with a multivalued
field language which contains every language that has an associated
description and several dinamic fields description_ one for each language.

So it would be for example:

Id: 1
Name: Product
Brand: Brand
Price: 10
Languages: [es,en]
Description_es: Descripción en español
Description_en: English description

Our first approach was using sub-entities for the data import handler and
after implementing some transformers we had everything indexed as we
wanted. The sub-entity process added the descriptions for each language to
the solr document and then indexed them.

The problem was performance. I've read that using sub-entities affected
performance greatly, so we changed our process in order to use a join
instead.

Performance was greatly improved this way but now we have a problem. Each
time a row is processed a solr document is generated and indexed into solr,
but the data is not added to any previous data, but it replaces it.

If we had the previous example the query resulting from the join would be:

Id - Name - Brand - Price - Language - Description
1 - Product - Brand - 10 - es - Descripción en español
1 - Product - Brand - 10 - en - English description

So when indexing as both have the same id the only information I get is the
second row.

Is there any way for data import handler to manage this and allow the
documents to be indexed updating any previous data?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Re: Solarium Extension

2014-08-07 Thread Shawn Heisey
On 8/7/2014 12:34 AM, pushkar sawant wrote:
 I have done installation of Solarium Search on Magento 1.7 ver. my Solr
 4.9 is also working in background.
 My Base OS is Ubuntu 13.10 on which solr 4.9 is running.
 when i go  check the extension in magento admin it only shows Test
 Connection.
 Please find attached image.
 
 Note - When Installing Extension on Magento through Content Manger it shows
 Installation done. But gives error please find attached img. for same.

The list will eat most attachments.  Yours did not make it.

Chances are that you won't be able to get much help with Solarium or
Magento here.  You'll need to find a mailing list or another support
venue for those programs.  They were not created by the Apache Solr project.

Thanks,
Shawn



Cannot finish recovery due to always met ReplicationHandler SnapPull failed: Unable to download xxx.fdt completely

2014-08-07 Thread forest_soup
I have 2 solr nodes(solr1 and solr2) in a SolrCloud. 
After some issue happened, solr2 are in recovering state. The peersync
cannot finish in about 15 min, so it turn to snappull. 
But when it's doing snap pull, it always met this issue below. Meanwhile,
there are still update requests sent to this recovering node(solr2) and the
good node(solr1). And the index in the recovering node is deleted and
rebuild again and again. So it takes lots of time to finish. 

Is it a bug or as solr design? 
And could anyone help me on accelerate the progress of recovery? 

Thanks! 

2014年7月17日 下午5:12:50ERROR   ReplicationHandler  SnapPull failed
:org.apache.solr.common.SolrException: Unable to download _vdq.fdt
completely. Downloaded 0!=182945 
SnapPull failed :org.apache.solr.common.SolrException: Unable to download
_vdq.fdt completely. Downloaded 0!=182945 
   at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1305)
 
   at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1185)
 
   at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:771) 
   at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:421) 
   at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322) 
   at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155) 
   at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437) 
   at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247) 


We have below settings in solrconfig.xml: 
 autoCommit  
   maxDocs1000/maxDocs  
   maxTime${solr.autoCommit.maxTime:15000}/maxTime
   openSearchertrue/openSearcher  
 /autoCommit

 autoSoftCommit  

   maxTime${solr.autoSoftCommit.maxTime:-1}/maxTime
 /autoSoftCommit

and the maxIndexingThreads8/maxIndexingThreads is as default. 

my solrconfig.xml is as attached.  solrconfig.xml
http://lucene.472066.n3.nabble.com/file/n4151611/solrconfig.xml  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Cannot-finish-recovery-due-to-always-met-ReplicationHandler-SnapPull-failed-Unable-to-download-xxx-fy-tp4151611.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Cannot finish recovery due to always met ReplicationHandler SnapPull failed: Unable to download xxx.fdt completely

2014-08-07 Thread Shalin Shekhar Mangar
Why does PeerSync take so much time? Are these two nodes in different data
centers or are they connected by a slow link?


On Thu, Aug 7, 2014 at 12:41 PM, forest_soup tanglin0...@gmail.com wrote:

 I have 2 solr nodes(solr1 and solr2) in a SolrCloud.
 After some issue happened, solr2 are in recovering state. The peersync
 cannot finish in about 15 min, so it turn to snappull.
 But when it's doing snap pull, it always met this issue below. Meanwhile,
 there are still update requests sent to this recovering node(solr2) and the
 good node(solr1). And the index in the recovering node is deleted and
 rebuild again and again. So it takes lots of time to finish.

 Is it a bug or as solr design?
 And could anyone help me on accelerate the progress of recovery?

 Thanks!

 2014年7月17日 下午5:12:50ERROR   ReplicationHandler  SnapPull failed
 :org.apache.solr.common.SolrException: Unable to download _vdq.fdt
 completely. Downloaded 0!=182945
 SnapPull failed :org.apache.solr.common.SolrException: Unable to download
 _vdq.fdt completely. Downloaded 0!=182945
at

 org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1305)
at

 org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1185)
at
 org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:771)
at
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:421)
at

 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322)
at
 org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
at

 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)


 We have below settings in solrconfig.xml:
  autoCommit
maxDocs1000/maxDocs
maxTime${solr.autoCommit.maxTime:15000}/maxTime
openSearchertrue/openSearcher
  /autoCommit

  autoSoftCommit

maxTime${solr.autoSoftCommit.maxTime:-1}/maxTime
  /autoSoftCommit

 and the maxIndexingThreads8/maxIndexingThreads is as default.

 my solrconfig.xml is as attached.  solrconfig.xml
 http://lucene.472066.n3.nabble.com/file/n4151611/solrconfig.xml



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Cannot-finish-recovery-due-to-always-met-ReplicationHandler-SnapPull-failed-Unable-to-download-xxx-fy-tp4151611.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Cannot finish recovery due to always met ReplicationHandler SnapPull failed: Unable to download xxx.fdt completely

2014-08-07 Thread forest_soup
Thanks. 
My env is 2 VM with good network condition. So not sure why it is happened.
We are trying to reproduce it. The peersync fail log is :
2014年7月25日 上午6:30:48
WARN
SnapPuller
Error in fetching packets
java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:154)
at
org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:146)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1211)
at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1174)
at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:771)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:421)
at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322)
at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Cannot-finish-recovery-due-to-always-met-ReplicationHandler-SnapPull-failed-Unable-to-download-xxx-fy-tp4151611p4151621.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Cannot finish recovery due to always met ReplicationHandler SnapPull failed: Unable to download xxx.fdt completely

2014-08-07 Thread forest_soup
I have opened one JIRA for it:
https://issues.apache.org/jira/browse/SOLR-6333



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Cannot-finish-recovery-due-to-always-met-ReplicationHandler-SnapPull-failed-Unable-to-download-xxx-fy-tp4151611p4151631.html
Sent from the Solr - User mailing list archive at Nabble.com.


org.apache.solr.common.SolrException: no servers hosting shard

2014-08-07 Thread forest_soup
I have 2 solr nodes(solr1 and solr2) in a SolrCloud. 
After this issue happened, solr2 are in recovering state. And after it takes
long time to finish recovery, there is this issue again, and it turn to
recovery again. It happens again and again. 

ERROR - 2014-08-04 21:12:27.917; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: no servers hosting shard: 
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:148)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:118)
at java.util.concurrent.FutureTask.run(FutureTask.java:273)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:482)
at java.util.concurrent.FutureTask.run(FutureTask.java:273)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
at java.lang.Thread.run(Thread.java:804)

We have those settings in solrconfig.xml different with default:

maxIndexingThreads24/maxIndexingThreads  
ramBufferSizeMB200/ramBufferSizeMB
maxBufferedDocs1/maxBufferedDocs 

 autoCommit 
   maxDocs1000/maxDocs 
   maxTime${solr.autoCommit.maxTime:15000}/maxTime
   openSearchertrue/openSearcher 
 /autoCommit
 autoSoftCommit 
   
   maxTime${solr.autoSoftCommit.maxTime:-1}/maxTime
 /autoSoftCommit


filterCache class=solr.FastLRUCache
 size=16384
 initialSize=16384
 autowarmCount=4096/
queryResultCache class=solr.LRUCache
 size=16384
 initialSize=16384
 autowarmCount=4096/
documentCache class=solr.LRUCache
   size=16384
   initialSize=16384
   autowarmCount=4096/
   fieldValueCache class=solr.FastLRUCache
size=16384
autowarmCount=1024
showItems=32 /
   queryResultWindowSize50/queryResultWindowSize

The full solrconfig.xml is as attachment.
solrconfig_perf0804.xml
http://lucene.472066.n3.nabble.com/file/n4151637/solrconfig_perf0804.xml  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-no-servers-hosting-shard-tp4151637.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-07 Thread Ali Nazemian
Thank you very much. But why we should go for solr distributed with hadoop?
There is already solrCloud which is pretty applicable in the case of big
index. Is there any advantage for sending indexes over map reduce that
solrCloud can not provide?
Regards.


On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: Are you aware of Cloudera search? I know they provide an integrated
 Hadoop ecosystem.

 What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
 sub-indexes for
 each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
 sub-indexes for
 each shard are merged (perhaps through some number of levels) in the reduce
 phase and
 maybe merged into a live Solr instance (--go-live). You'll note that this
 tool requires the
 address of the ZK ensemble from which it can get the network topology,
 configuration files,
 all that rot. If you don't use the --go-live option, the output is still a
 Solr index, it's just that
 the index for each shard is left in a specific directory on HDFS. Being on
 HDFS allows
 this kind of M/R paradigm for massively parallel indexing operations, and
 perhaps massively
 complex analysis.

 Nowhere is there any low-level non-Solr manipulation of the indexes.

 The Flume fork just writes directly to the Solr nodes. It knows about the
 ZooKeeper
 ensemble and the collection too and communicates via SolrJ I'm pretty sure.

 As far as integrating with HDFS, you're right, HA is part of the package.
 As far as using
 the Solr indexes for analysis, well you can write anything you want to use
 the Solr indexes
 from anywhere in the M/R world and have them available from anywhere in the
 cluster. There's
 no real need to even have Solr running, you could use the output from MRIT
 and access the
 sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
 pesky servlet
 container stuff.

 bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
 for this purpose?
 What is the point?

 Scale and data access in a nutshell. In the HDFS world, you can scale
 pretty linearly
 with the number of nodes you can rack together.

 Frankly though, if your data set is small enough to fit on a single machine
 _and_ you can get
 through your analysis in a reasonable time (reasonable here is up to you),
 then HDFS
 is probably not worth the hassle. But in the big data world where we're
 talking petabyte scale,
 having HDFS as the underpinning opens up possibilities for working on data
 that were
 difficult/impossible with Solr previously.

 Best,
 Erick



 On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian alinazem...@gmail.com
 wrote:

  Dear Erick,
  I remembered some times ago, somebody asked about what is the point of
  modify Solr to use HDFS for storing indexes. As far as I remember
 somebody
  told him integrating Solr with HDFS has two advantages. 1) having hadoop
  replication and HA. 2) using indexes and Solr documents for other
 purposes
  such as Analysis. So why we go for HDFS in the case of analysis if we
 want
  to use SolrJ for this purpose? What is the point?
  Regards.
 
 
  On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian alinazem...@gmail.com
  wrote:
 
   Dear Erick,
   Hi,
   Thank you for you reply. Yeah I am aware that SolrJ is my last option.
 I
   was thinking about raw I/O operation. So according to your reply
 probably
   it is not applicable somehow. What about the Lily project that Michael
   mentioned? Is that consider SolrJ too? Are you aware of Cloudera
 search?
  I
   know they provide an integrated Hadoop ecosystem. Do you know what is
  their
   suggestion?
   Best regards.
  
  
  
   On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson 
 erickerick...@gmail.com
  
   wrote:
  
   What you haven't told us is what you mean by modify the
   index outside Solr. SolrJ? Using raw Lucene? Trying to modify
   things by writing your own codec? Standard Java I/O operations?
   Other?
  
   You could use SolrJ to connect to an existing Solr server and
   both read and modify at will form your M/R jobs. But if you're
   thinking of trying to write/modify the segment files by raw I/O
   operations, good luck! I'm 99.99% certain that's going to cause
   you endless grief.
  
   Best,
   Erick
  
  
   On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian alinazem...@gmail.com
   wrote:
  
Actually I am going to do some analysis on the solr data using map
   reduce.
For this purpose it might be needed to change some part of data or
 add
   new
fields from outside solr.
   
   
On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey s...@elyograg.org
  wrote:
   
 On 8/5/2014 7:04 AM, Ali Nazemian wrote:
  I changed solr 4.9 to write index and data on hdfs. Now I am
 going
   to
  connect to those data from the outside of solr for changing some
  of
   the
  values. Could somebody please tell me how that is possible?
  Suppose
   I
am
  using Hbase over hdfs for do these changes.


solr in classic asp project

2014-08-07 Thread Sandeep Bohra
I am using an classic ASP 3.0 application and would like to implement SOLR
onto it. My database is SQL server and also it connects to AS/400 using
batch processing. Can someone suggest a starting point?



*RegardsSandeep*


Re: solr in classic asp project

2014-08-07 Thread parnab kumar
Can you elaborate on how you plan to use SOLR in your project?

Parnab..
CSE, IIT Kharagpur



On Thu, Aug 7, 2014 at 12:51 PM, Sandeep Bohra 
sandeep.bo...@3pillarglobal.com wrote:

 I am using an classic ASP 3.0 application and would like to implement SOLR
 onto it. My database is SQL server and also it connects to AS/400 using
 batch processing. Can someone suggest a starting point?



 *RegardsSandeep*



Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-07 Thread Erick Erickson
If SolrCloud meets your needs, without Hadoop, then
there's no real reason to introduce the added complexity.

There are a bunch of problems that do _not_ work
well with SolrCloud over non-Hadoop file systems. For
those problems, the combination of SolrCloud and Hadoop
make tackling them possible.

Best,
Erick


On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian alinazem...@gmail.com wrote:

 Thank you very much. But why we should go for solr distributed with hadoop?
 There is already solrCloud which is pretty applicable in the case of big
 index. Is there any advantage for sending indexes over map reduce that
 solrCloud can not provide?
 Regards.


 On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  bq: Are you aware of Cloudera search? I know they provide an integrated
  Hadoop ecosystem.
 
  What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
  sub-indexes for
  each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
  sub-indexes for
  each shard are merged (perhaps through some number of levels) in the
 reduce
  phase and
  maybe merged into a live Solr instance (--go-live). You'll note that this
  tool requires the
  address of the ZK ensemble from which it can get the network topology,
  configuration files,
  all that rot. If you don't use the --go-live option, the output is still
 a
  Solr index, it's just that
  the index for each shard is left in a specific directory on HDFS. Being
 on
  HDFS allows
  this kind of M/R paradigm for massively parallel indexing operations, and
  perhaps massively
  complex analysis.
 
  Nowhere is there any low-level non-Solr manipulation of the indexes.
 
  The Flume fork just writes directly to the Solr nodes. It knows about the
  ZooKeeper
  ensemble and the collection too and communicates via SolrJ I'm pretty
 sure.
 
  As far as integrating with HDFS, you're right, HA is part of the package.
  As far as using
  the Solr indexes for analysis, well you can write anything you want to
 use
  the Solr indexes
  from anywhere in the M/R world and have them available from anywhere in
 the
  cluster. There's
  no real need to even have Solr running, you could use the output from
 MRIT
  and access the
  sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
  pesky servlet
  container stuff.
 
  bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
  for this purpose?
  What is the point?
 
  Scale and data access in a nutshell. In the HDFS world, you can scale
  pretty linearly
  with the number of nodes you can rack together.
 
  Frankly though, if your data set is small enough to fit on a single
 machine
  _and_ you can get
  through your analysis in a reasonable time (reasonable here is up to
 you),
  then HDFS
  is probably not worth the hassle. But in the big data world where we're
  talking petabyte scale,
  having HDFS as the underpinning opens up possibilities for working on
 data
  that were
  difficult/impossible with Solr previously.
 
  Best,
  Erick
 
 
 
  On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian alinazem...@gmail.com
  wrote:
 
   Dear Erick,
   I remembered some times ago, somebody asked about what is the point of
   modify Solr to use HDFS for storing indexes. As far as I remember
  somebody
   told him integrating Solr with HDFS has two advantages. 1) having
 hadoop
   replication and HA. 2) using indexes and Solr documents for other
  purposes
   such as Analysis. So why we go for HDFS in the case of analysis if we
  want
   to use SolrJ for this purpose? What is the point?
   Regards.
  
  
   On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian alinazem...@gmail.com
   wrote:
  
Dear Erick,
Hi,
Thank you for you reply. Yeah I am aware that SolrJ is my last
 option.
  I
was thinking about raw I/O operation. So according to your reply
  probably
it is not applicable somehow. What about the Lily project that
 Michael
mentioned? Is that consider SolrJ too? Are you aware of Cloudera
  search?
   I
know they provide an integrated Hadoop ecosystem. Do you know what is
   their
suggestion?
Best regards.
   
   
   
On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
What you haven't told us is what you mean by modify the
index outside Solr. SolrJ? Using raw Lucene? Trying to modify
things by writing your own codec? Standard Java I/O operations?
Other?
   
You could use SolrJ to connect to an existing Solr server and
both read and modify at will form your M/R jobs. But if you're
thinking of trying to write/modify the segment files by raw I/O
operations, good luck! I'm 99.99% certain that's going to cause
you endless grief.
   
Best,
Erick
   
   
On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian alinazem...@gmail.com
 
wrote:
   
 Actually I am going to do some analysis on the solr data using map
reduce.
 For this 

RE: Data Import handler and join select

2014-08-07 Thread Dyer, James
Alejandro,

You can use a sub-entity with a cache using DIH.  This will solve the 
n+1-select problem and make it run quickly.  Unfortunately, the only built-in 
cache implementation is in-memory so it doesn't scale.  There is a fast, 
disk-backed cache using bdb-je, which I use in production.  See 
https://issues.apache.org/jira/browse/SOLR-2613 .  You will need to build this 
youself and include it on the classpath, and obtain a copy of bdb-je from 
Oracle.  While bdb-je is open source, its license is incompatible with ASL so 
this will never officially be part of Solr.

Once you have a disk-backed cache, you can specify it on the child entity like 
this:
entity name=parent query=select id, ... from parent table
entity 
name=child 
query=select foreignKey, ... from child_table
cacheKey=foreignKey 
cacheLookup=parent.id
processor=SqlEntityProcessor 
transformer=...
cacheImpl=BerkleyBackedCache
/
/entity

If you don't want to go down this path, you can achieve this all with one 
query, if you include and ORDER BY to sort by whatever field is used as Solr's 
uniqueKey, and add a dummy row at the end with a UNION:

SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p 
INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey
UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual 
ORDER BY uniqueKey, lastInd

Then your transformer would need to keep the lastUniqueKey in an instance 
variable and keep a running map of everything its seen for that key.  When the 
key changes, or if on the last row, send that map as the document.  Otherwise, 
the transformer returns null.  This will collect data from each row seen onto 
one document.

Keep in mind also, that in a lot of cases like this, it might just be easiest 
to write a program that uses solrj to send your documents rather than trying to 
make DIH's features fit your use-case.  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, August 07, 2014 1:43 AM
To: solr-user@lucene.apache.org
Subject: Data Import handler and join select

Hi,

I have one problem while indexing with data import hadler while doing a
join select. I have two tables, one with products and another one with
descriptions for each product in several languages.

So it would be:

Products: ID, NAME, BRAND, PRICE, ...
Descriptions: ID, LANGUAGE, DESCRIPTION

I would like to have every product indexed as a document with a multivalued
field language which contains every language that has an associated
description and several dinamic fields description_ one for each language.

So it would be for example:

Id: 1
Name: Product
Brand: Brand
Price: 10
Languages: [es,en]
Description_es: Descripción en español
Description_en: English description

Our first approach was using sub-entities for the data import handler and
after implementing some transformers we had everything indexed as we
wanted. The sub-entity process added the descriptions for each language to
the solr document and then indexed them.

The problem was performance. I've read that using sub-entities affected
performance greatly, so we changed our process in order to use a join
instead.

Performance was greatly improved this way but now we have a problem. Each
time a row is processed a solr document is generated and indexed into solr,
but the data is not added to any previous data, but it replaces it.

If we had the previous example the query resulting from the join would be:

Id - Name - Brand - Price - Language - Description
1 - Product - Brand - 10 - es - Descripción en español
1 - Product - Brand - 10 - en - English description

So when indexing as both have the same id the only information I get is the
second row.

Is there any way for data import handler to manage this and allow the
documents to be indexed updating any previous data?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Disabling transaction logs

2014-08-07 Thread KNitin
Hello

I am using solr 4.6.1 with over 1000 collections and 8 nodes. Restarting of
nodes takes a long time (especially if we have indexing running against it)
. I want to see if disabling transaction logs can help with a better robust
restart. However I can't see any docs around disabling txn logs in solrcloud

Can anyone help with info on how to disable transaction logs ?


Thanks
Nitin


Re: Character encoding problems

2014-08-07 Thread Chris Hostetter

It's not clear to me from any of the comments you've made in this thread 
wether you've ever confirmed *exactly* what you are getting back from 
solr, ignoring the PHP completley. (ie: you refer to UTF-8 for all of the 
web pages suggesting you are only looking at some web application which 
is consuming dat from solr)

What do you see when you use something like curl to talk to solr directly 
and inspect the raw bytes (in both directions) ?

For example...

$ echo '[{id:HOSS,fr_s:téléphone}]'  french.json
$ # sanity check that my shell didn't bork the utf8
$ cat french.json | uniname -ap
character  byte   UTF-32   encoded as glyph   name
   23 23  E9   C3 A9  é  LATIN SMALL LETTER E WITH 
ACUTE
   25 26  E9   C3 A9  é  LATIN SMALL LETTER E WITH 
ACUTE
$ curl -sS -X POST 'http://localhost:8983/solr/collection1/update?commit=true' 
-H 'Content-Type: application/json' -d @french.json 
{responseHeader:{status:0,QTime:445}}
$ curl -sS 
'http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonomitHeader=trueindent=true'
{
  response:{numFound:1,start:0,docs:[
  {
id:HOSS,
fr_s:téléphone,
_version_:1475795659384684544}]
  }}
$ curl -sS 
'http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonomitHeader=trueindent=true'
 | uniname -ap
character  byte   UTF-32   encoded as glyph   name
   94 94  E9   C3 A9  é  LATIN SMALL LETTER E WITH 
ACUTE
   96 97  E9   C3 A9  é  LATIN SMALL LETTER E WITH 
ACUTE



One other cool diagnostic trick you can use, if the data coming back 
over the wire is definitely no longer utf8, is to leverate the python 
response writer, because it generates \uXX escape sequences for 
non-ASCII strings at the solr level -- if those are correct, that helps 
you clearly identify that it's the HTTP layer where your values are 
getting corrupted...

$ curl -sS 
'http://localhost:8983/solr/collection1/select?q=id:HOSSwt=pythonomitHeader=trueindent=true'
{
  'response':{'numFound':1,'start':0,'docs':[
  {
'id':'HOSS',
'fr_s':u't\u00e9l\u00e9phone',
'_version_':1475795807492898816}]
  }}


-Hoss
http://www.lucidworks.com/

Re: Disabling transaction logs

2014-08-07 Thread Anshum Gupta
Hi Nitin,

To answer your question first, yes, you can disable the transaction log by
commenting/removing the  updateLog part of the solrconfig.xml.

At the same time, I'd highly recommend not disabling transaction logs. They
are needed for NRT, peer sync, high availability/disaster recovery parts of
SolrCloud i.e. a lot of what makes SolrCloud depends on these logs. When
you say you want a robust restart, I think that is what you're getting
right now. If you mean to make the entire process faster, read the post
below and you should be in a much better position.

Here's a writeup by Erik Erickson on soft/hard commits and transaction logs
in Solr that would help you understand this better.
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


On Thu, Aug 7, 2014 at 9:12 AM, KNitin nitin.t...@gmail.com wrote:

 Hello

 I am using solr 4.6.1 with over 1000 collections and 8 nodes. Restarting of
 nodes takes a long time (especially if we have indexing running against it)
 . I want to see if disabling transaction logs can help with a better robust
 restart. However I can't see any docs around disabling txn logs in
 solrcloud

 Can anyone help with info on how to disable transaction logs ?


 Thanks
 Nitin




-- 

Anshum Gupta
http://www.anshumgupta.net


Change order of spell checker suggestions issue

2014-08-07 Thread Corey Gerhardt
Solr Rev: 4.6 Lucidworks: 2.6.3

This is sort of a repeat question, sorry.

In the solrconfig.xml, will changing the value for the comparatorClass affect 
the sort of suggestions returned?

This is my spellcheck component:
searchComponent 
class=com.lucid.spellchecking.LucidSpellCheckComponent name=spellcheck
lst name=defaults
str 
name=spellcheck.onlyMorePopularfalse/str
str 
name=spellcheck.extendedResultstrue/str
str 
name=spellcheck.count5/str
/lst

str 
name=queryAnalyzerFieldTypetextSpell/str

lst name=spellchecker
str 
name=classnameorg.apache.solr.spelling.DirectSolrSpellChecker/str
str name=namedefault/str
str name=fieldspell/str
str 
name=distanceMeasureinternal/str
float 
name=accuracy0.5/float
int name=maxEdits2/int
int name=minPrefix1/int
int 
name=maxInspections5/int
str 
name=comparatorClassscore/str
float 
name=thresholdTokenFrequency1/float
int 
name=minQueryLength4/int
float 
name=maxQueryFrequency0.01/float
/lst
  /searchComponent

Searching for unie produces the following suggestions. But the suggestions 
appear to me to be by frequency (I've indicated Levenshtein distance in []):

lst

str name=wordunity/str [ 3  ]

int name=freq1200/int

/lst

lst

str name=wordunger/str [ 3  ]

int name=freq119/int

/lst

lst

str name=wordunick/str [ 3 ]

int name=freq16/int

/lst

lst

str name=wordunited/str [ 4 ]

int name=freq16/int

/lst

lst

str name=wordunique/str [ 4 ]

int name=freq10/int

/lst

lst

str name=wordunity/str [ 3 ]

int name=freq7/int

/lst

lst

str name=wordunser/str [ 3 ]

int name=freq7/int

/lst

lst

str name=wordunyi/str [ 2 ]

int name=freq7/int

/lst

Is something configured incorrectly or am I just needing more coffee?


Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-07 Thread Ali Nazemian
Dear Erick,
Could you please name those problems that SolrCloud can not tackle them
alone? Maybe I need solrCloud+ Hadoop and I am not aware of that yet.
Regards.


On Thu, Aug 7, 2014 at 7:37 PM, Erick Erickson erickerick...@gmail.com
wrote:

 If SolrCloud meets your needs, without Hadoop, then
 there's no real reason to introduce the added complexity.

 There are a bunch of problems that do _not_ work
 well with SolrCloud over non-Hadoop file systems. For
 those problems, the combination of SolrCloud and Hadoop
 make tackling them possible.

 Best,
 Erick


 On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian alinazem...@gmail.com
 wrote:

  Thank you very much. But why we should go for solr distributed with
 hadoop?
  There is already solrCloud which is pretty applicable in the case of big
  index. Is there any advantage for sending indexes over map reduce that
  solrCloud can not provide?
  Regards.
 
 
  On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   bq: Are you aware of Cloudera search? I know they provide an integrated
   Hadoop ecosystem.
  
   What Cloudera Search does via the MapReduceIndexerTool (MRIT) is
 create N
   sub-indexes for
   each shard in the M/R paradigm via EmbeddedSolrServer. Eventually,
 these
   sub-indexes for
   each shard are merged (perhaps through some number of levels) in the
  reduce
   phase and
   maybe merged into a live Solr instance (--go-live). You'll note that
 this
   tool requires the
   address of the ZK ensemble from which it can get the network topology,
   configuration files,
   all that rot. If you don't use the --go-live option, the output is
 still
  a
   Solr index, it's just that
   the index for each shard is left in a specific directory on HDFS. Being
  on
   HDFS allows
   this kind of M/R paradigm for massively parallel indexing operations,
 and
   perhaps massively
   complex analysis.
  
   Nowhere is there any low-level non-Solr manipulation of the indexes.
  
   The Flume fork just writes directly to the Solr nodes. It knows about
 the
   ZooKeeper
   ensemble and the collection too and communicates via SolrJ I'm pretty
  sure.
  
   As far as integrating with HDFS, you're right, HA is part of the
 package.
   As far as using
   the Solr indexes for analysis, well you can write anything you want to
  use
   the Solr indexes
   from anywhere in the M/R world and have them available from anywhere in
  the
   cluster. There's
   no real need to even have Solr running, you could use the output from
  MRIT
   and access the
   sub-shards with the EmbeddedSolrServer if you wanted, leaving out all
 the
   pesky servlet
   container stuff.
  
   bq: So why we go for HDFS in the case of analysis if we want to use
 SolrJ
   for this purpose?
   What is the point?
  
   Scale and data access in a nutshell. In the HDFS world, you can scale
   pretty linearly
   with the number of nodes you can rack together.
  
   Frankly though, if your data set is small enough to fit on a single
  machine
   _and_ you can get
   through your analysis in a reasonable time (reasonable here is up to
  you),
   then HDFS
   is probably not worth the hassle. But in the big data world where we're
   talking petabyte scale,
   having HDFS as the underpinning opens up possibilities for working on
  data
   that were
   difficult/impossible with Solr previously.
  
   Best,
   Erick
  
  
  
   On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian alinazem...@gmail.com
   wrote:
  
Dear Erick,
I remembered some times ago, somebody asked about what is the point
 of
modify Solr to use HDFS for storing indexes. As far as I remember
   somebody
told him integrating Solr with HDFS has two advantages. 1) having
  hadoop
replication and HA. 2) using indexes and Solr documents for other
   purposes
such as Analysis. So why we go for HDFS in the case of analysis if we
   want
to use SolrJ for this purpose? What is the point?
Regards.
   
   
On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian alinazem...@gmail.com
wrote:
   
 Dear Erick,
 Hi,
 Thank you for you reply. Yeah I am aware that SolrJ is my last
  option.
   I
 was thinking about raw I/O operation. So according to your reply
   probably
 it is not applicable somehow. What about the Lily project that
  Michael
 mentioned? Is that consider SolrJ too? Are you aware of Cloudera
   search?
I
 know they provide an integrated Hadoop ecosystem. Do you know what
 is
their
 suggestion?
 Best regards.



 On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson 
   erickerick...@gmail.com

 wrote:

 What you haven't told us is what you mean by modify the
 index outside Solr. SolrJ? Using raw Lucene? Trying to modify
 things by writing your own codec? Standard Java I/O operations?
 Other?

 You could use SolrJ to connect to an existing Solr server and
 both read and modify at will form your M/R 

RE: Change order of spell checker suggestions issue

2014-08-07 Thread Dyer, James
Corey,

Looking more carefully at your responses than I did last time I answered this 
question, it looks like every correction is 2 edits in this example.  

unie  unity (et , insert y)
unie  unger (ig , insert r)
unie  unick (ec , insert k)
unie  united (delete t , insert d)
unie  unique (delete q, u)
unie  unity (et , insert y)
unie  unser (si , insert r)
unie  unyi (iy , ei)

So both score and freq will give it to you by frequency.  Usually when I'm 
in doubt of something like this working like it should, I try to come up with 
more than 1 clear-cut example.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Corey Gerhardt [mailto:corey.gerha...@directwest.com] 
Sent: Thursday, August 07, 2014 11:31 AM
To: Solr User List
Subject: Change order of spell checker suggestions issue

Solr Rev: 4.6 Lucidworks: 2.6.3

This is sort of a repeat question, sorry.

In the solrconfig.xml, will changing the value for the comparatorClass affect 
the sort of suggestions returned?

This is my spellcheck component:
searchComponent 
class=com.lucid.spellchecking.LucidSpellCheckComponent name=spellcheck
lst name=defaults
str 
name=spellcheck.onlyMorePopularfalse/str
str 
name=spellcheck.extendedResultstrue/str
str 
name=spellcheck.count5/str
/lst

str 
name=queryAnalyzerFieldTypetextSpell/str

lst name=spellchecker
str 
name=classnameorg.apache.solr.spelling.DirectSolrSpellChecker/str
str name=namedefault/str
str name=fieldspell/str
str 
name=distanceMeasureinternal/str
float 
name=accuracy0.5/float
int name=maxEdits2/int
int name=minPrefix1/int
int 
name=maxInspections5/int
str 
name=comparatorClassscore/str
float 
name=thresholdTokenFrequency1/float
int 
name=minQueryLength4/int
float 
name=maxQueryFrequency0.01/float
/lst
  /searchComponent

Searching for unie produces the following suggestions. But the suggestions 
appear to me to be by frequency (I've indicated Levenshtein distance in []):

lst

str name=wordunity/str [ 3  ]

int name=freq1200/int

/lst

lst

str name=wordunger/str [ 3  ]

int name=freq119/int

/lst

lst

str name=wordunick/str [ 3 ]

int name=freq16/int

/lst

lst

str name=wordunited/str [ 4 ]

int name=freq16/int

/lst

lst

str name=wordunique/str [ 4 ]

int name=freq10/int

/lst

lst

str name=wordunity/str [ 3 ]

int name=freq7/int

/lst

lst

str name=wordunser/str [ 3 ]

int name=freq7/int

/lst

lst

str name=wordunyi/str [ 2 ]

int name=freq7/int

/lst

Is something configured incorrectly or am I just needing more coffee?



Wrong XSLT used in translation

2014-08-07 Thread Christopher Gross
Solr 4.1, in SolrCloud mode.  3 nodes configured, Running in Tomcat 7 w/
Java 7.

I have a few cores set up, let's just call them A, B, C and D.   They have
some uniquely named xslt files, but they all have a rss.xsl file.

Sometimes, on just 1 of the nodes, if I do a query for something in A and
translate it with the rss.xsl, it will do the query just fine and give the
right number of results (solr logged the query and had it going to the
correct core), but it uses B or C's rss.xsl.  Since the schemas are
different, the xml is mostly empty.  A refresh will have it go back to
using the correct rss.xsl.

Has anyone run into a problem like this?  Is it a problem with the 4.1
Solr?  Will upgrading fix it?

Is it a better practice to uniquely name the xslt files for each core
(having a-rss.xsl, b-rss.xsl, etc)?

Any help/thoughts would be appreciated.

-- Chris


Re: Wrong XSLT used in translation

2014-08-07 Thread Shawn Heisey
On 8/7/2014 1:46 PM, Christopher Gross wrote:
 Solr 4.1, in SolrCloud mode.  3 nodes configured, Running in Tomcat 7 w/
 Java 7.

 I have a few cores set up, let's just call them A, B, C and D.   They have
 some uniquely named xslt files, but they all have a rss.xsl file.

 Sometimes, on just 1 of the nodes, if I do a query for something in A and
 translate it with the rss.xsl, it will do the query just fine and give the
 right number of results (solr logged the query and had it going to the
 correct core), but it uses B or C's rss.xsl.  Since the schemas are
 different, the xml is mostly empty.  A refresh will have it go back to
 using the correct rss.xsl.

 Has anyone run into a problem like this?  Is it a problem with the 4.1
 Solr?  Will upgrading fix it?

 Is it a better practice to uniquely name the xslt files for each core
 (having a-rss.xsl, b-rss.xsl, etc)?

I wonder if Solr might have a bug with XSLT caching, where the cache is
global and simply looks at the base filename, not the full path.  If it
works when you use xsl files with different names, then that is the most
likely problem.

If you determine that the bug I mentioned is what's happening, before
filing a bug in Jira, we need to determine whether it's still a problem
in the latest version.  Version 4.1 came out in January 2013.  Upgrading
is definitely advised, if you can do it.

Thanks,
Shawn



Re: Anybody uses Solr JMX?

2014-08-07 Thread Otis Gospodnetic
Hi Paul,

There are lots of people/companies using SPM for Solr/SolrCloud and I don't
recall anyone saying SPM agent collecting metrics via JMX had a negative
impact on Solr performance.  That said, some people really dislike JMX and
some open source projects choose to expose metrics via custom stats APIs or
even files.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Aug 6, 2014 at 11:18 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Hello Otis,

 this looks like an excellent idea!
 I'm in need of that, erm… last week and probably this one too.

 Is there not a risk that reading certain JMX properties actually hogs the
 process? (or is it by design that MBeans are supposed to be read without
 any lock effect?).

 thanks for the hint.

 paul



 On 6 mai 2014, at 04:43, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Alexandre, you could use something like
  http://blog.sematext.com/2012/09/25/new-tool-jmxc-jmx-console/ to
 quickly
  dump everything out of JMX and see if there is anything there Solr Admin
 UI
  doesn't expose.  I think you'll find there is more in JMX than Solr Admin
  UI shows.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, May 5, 2014 at 1:56 AM, Alexandre Rafalovitch 
 arafa...@gmail.comwrote:
 
  Thank you everybody for the links and explanations.
 
  I am still curious whether JMX exposes more details than the Admin UI?
  I am thinking of a troubleshooting context, rather than long-term
  monitoring one.
 
  Regards,
Alex.
  Personal website: http://www.outerthoughts.com/
  Current project: http://www.solr-start.com/ - Accelerating your Solr
  proficiency
 
 
  On Mon, May 5, 2014 at 12:21 PM, Gora Mohanty g...@mimirtech.com
 wrote:
  On May 5, 2014 7:09 AM, Alexandre Rafalovitch arafa...@gmail.com
  wrote:
 
  I have religiously kept jmx statement in my solrconfig.xml, thinking
  it was enabling the web interface statistics output.
 
  But looking at the server logs really closely, I can see that JMX is
  actually disabled without server present. And the Admin UI does not
  actually seem to care after a quick test.
 
  Does anybody have a real experience with Solr JMX? Does it expose more
  information than Admin UI's Plugins/Stats page? Is it good for
 
 
  Have not been using JMX lately, but we were using it in the past. It
 does
  allow monitoring many useful details. As others have commented, it also
  integrates well with other monitoring  tools as JMX is a standard.
 
  Regards,
  Gora
 




Re: Anybody uses Solr JMX?

2014-08-07 Thread rulinma
useful.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anybody-uses-Solr-JMX-tp4134598p4151820.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to sync lib directory in SolrCloud?

2014-08-07 Thread rulinma
mark.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-sync-lib-directory-in-SolrCloud-tp4150405p4151821.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why solr commit with serval docs

2014-08-07 Thread rulinma
code error by my colleague.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-solr-commit-with-serval-docs-tp4150583p4151822.html
Sent from the Solr - User mailing list archive at Nabble.com.