date:20121029


Sounds like it is multivalued - the square brackets indicate an array.

-- Jack Krupansky

-Original Message- 
From: Radek Zajkowski

Sent: Monday, October 29, 2012 8:37 PM
To: solr-user@lucene.apache.org
Subject: row.get() in script transformer adds square brackets [] to string 
value


Hi all,

would you know why I get (notice square brackets)

[1969 Harley Davidson Ultimate Chopper]

not

1969 Harley Davidson Ultimate Chopper

when calling

var description = row.get("ProductName").toString();

in a script transformer?

Thank you,

Radek.

Re: Urgent Help Needed: Solr Data import problem

2012-10-29 Thread Amit Nithian

This looks like a MySQL permissions problem and not a Solr problem.
"Caused by: java.sql.SQLException: Access denied for user
'readonly'@'10.86.29.32'
(using password: NO)"

I'd advise reading your stack traces a bit more carefully. You should
check your permissions or if you don't own the DB, check with your DBA
to find out what user you should use to access your DB.

- Amit

On Mon, Oct 29, 2012 at 9:38 PM, kunal sachdeva
 wrote:
> Hi,
>
> I have tried using data-import in my local system. I was able to execute it
> properly. but when I tried to do it unix server I got following error:-
>
>
> INFO: Starting Full Import
> Oct 30, 2012 9:40:49 AM
> org.apache.solr.handler.dataimport.SimplePropertiesWriter
> readIndexerProperties
> WARNING: Unable to read: dataimport.properties
> Oct 30, 2012 9:40:49 AM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> INFO: [core0] REMOVING ALL DOCUMENTS FROM INDEX
> Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=1
>
> commit{dir=/opt/testsolr/multicore/core0/data/index,segFN=segments_1,version=1351490646879,generation=1,filenames=[segments_1]
> Oct 30, 2012 9:40:49 AM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> INFO: newest commit = 1351490646879
> Oct 30, 2012 9:40:49 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
> call
> INFO: Creating a connection for entity destination with URL: jdbc:mysql://
> 172.16.37.160:3306/hpcms_db_new
> Oct 30, 2012 9:40:50 AM org.apache.solr.common.SolrException log
> SEVERE: Exception while processing: destination document :
> SolrInputDocument[{}]:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query: select name,id from hp_city Processing Document # 1
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> execute query: select name,id from hp_city Processing Document # 1
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
> ... 3 more
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query: select name,id from hp_city Processing Document # 1
> at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:253)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
> ... 5 more
> Caused by: java.sql.SQLException: Access denied for user
> 'readonly'@'10.86.29.32'
> (using password: NO)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910)
> at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:3923)
> at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1273)
> at
> com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2031)
> at com.mysql.jdbc.ConnectionImpl.(ConnectionImpl.java:718)
> at com.mysql.jdbc.JDBC4Connection.(JDBC4Connection.java:46)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccesso

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download


Thanks Michael for the feedback. Will take a look at this ...

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org


On 10/29/2012 9:17 AM, Michael Della Bitta wrote:

As an external observer, I think the main problem is your branding.
"Realtime Near Realtime" is definitely an oxymoron, and your ranking
algorithm is called "Ranking Algorithm," which is generic enough to
suggest that a. it's the only ranking algorithm available, and b. by
implication, that Solr doesn't have one built in.

I would suggest two improvements:

1. Come up with a top-level name for your overall efforts. Apache
Foundation has 'Apache,' so automatic branding of every component they
build. Then your ranking algorithm could be called "Tgels Ranking
Algorithm for Apache Solr" (for example), which is totally legit. And
"Tgels Realtime Search for Apache Solr."

2. Maybe point out that you're building on top of the work of the
Apache Solr and Lucene projects a little more prominently.

I think with those two little tweaks, you'd actually very easily get
more people interested in your contributions.

Just my two cents,

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Oct 29, 2012 at 11:35 AM, Nagendra Nagarajayya
 wrote:

Jack:

I respect your hard-work responding to user problems on the mail list. So it
would be nicer to try out Realtime NRT then pass rogue comments, whether a
contribution is legit/spam or a scam... I guess it illuminates the narrow
minded view of oneself ...  The spirit of open source is contributions from
not only commiters but other developers, from the Solr wiki "A half-baked
patch in Jira, with no documentation, no tests and no backwards
compatibility is better than no patch at all."

You would gain more respect if you actually download realtime-nrt, check out
if it does provide a view of a realtime index compared to a  point-in-time
snapshot, see if you can understand the code and provide clarity  and
feedback to the list if you do find problems with it. realtime-nrt offers
search capability as to realtime-get. Checkout if  this is true ... I would
really welcome your comments on the list or through the JIRA here:

https://issues.apache.org/jira/browse/SOLR-3816


Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 10/29/2012 7:30 AM, Jack Krupansky wrote:

Could any of the committers here confirm whether this is a legitimate
effort? I mean, how could anything labeled "Apache ABC with XYZ" be an
"external project" and be sanctioned/licensed by Apache? In fact, the linked
web page doesn't even acknowledge the ownership of the Apache trademarks or
ASL. And the term "Realtime NRT" is nonsensical. Even worse: "Realtime NRT
makes available a near realtime view". Equally nonsensical. Who knows, maybe
it is legit, but it sure comes across as a scam/spam.

-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and
Realtime NRT available for download

Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here 
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ą and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0
api.

You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

realtime comes from the tag used to enable the functionality in 
solrconfig.xml. nrt is used as an acronym as in radar/laser/jpeg/cdrom, 
etc. nrt is so well known I did not expect it to be expanded to its full 
form  which does make it oxymoronic ...


Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 10/29/2012 7:44 AM, Darren Govoni wrote:
It certainly seems to be a rogue project, but I can't understand the 
meaning of "realtime near realtime (NRT)" either. At best, its 
oxymoronic.



On 10/29/2012 10:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled "Apache ABC with XYZ" be 
an "external project" and be sanctioned/licensed by Apache? In fact, 
the linked web page doesn't even acknowledge the ownership of the 
Apache trademarks or ASL. And the term "Realtime NRT" is nonsensical. 
Even worse: "Realtime NRT makes available a near realtime view". 
Equally nonsensical. Who knows, maybe it is legit, but it sure comes 
across as a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here 
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 
4.0 api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

using ConcurrentUpdateSolrServer and CloudSolrServer to add documents

2012-10-29 Thread wshao

Hi there,

I am new to SOLR and trying to use MapReduce to index on 4.0. Per online
suggenstions, I tried both ConcurrentUpdateSolrServer and CloudSolrServer. 

For ConcurrentUpdateSolrServer, I did this:
in setup:
int taskId = context.getTaskAttemptID().getTaskID().getId(); 
int serverId = taskId % 5 + 5; // rotate shard ID by using nature of mapduce
task ID
String url = "http://solr"; + serverId + ":8983/solr/core0";
logger.info("using " + url);
server = new ConcurrentUpdateSolrServer(url, 1000, 1);

in reduce:
do add the documents

in cleanup:
server.commit();

I run 5 reducers on 4 million documents, in the log it shows one reducer
calls one solr nodes, so there should be no racing condition there. it took
like 10 minutes for whole job to be done. However, I lost 20% of the
documents in the index and only got 320 documents overall.

For CloudSolrServer, I did this:
in setup:
try {
 server = new CloudSolrServer("solr:9983");
 server.setDefaultCollection("core0");
} catch (MalformedURLException e) {
 logger.error(e);
}

in reduce:
do add the documents

in cleanup:
server.commit();

With this one, it took 1 hour for 4 million documents, but I do get all
documents in the index.

Time-wise, I much prefer use ConcurrentUpdateSolrServer, however, I cannot
accept that so many documents lost over the process. But as a SOLR newbie, I
might miss something obvious here and I don't know what it is. Could someone
tell? Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-ConcurrentUpdateSolrServer-and-CloudSolrServer-to-add-documents-tp4016885.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR 4 / Tomcat Startup Error: java.lang.NoClassDefFoundError: org/apache/lucene/codecs/sep/IntStreamFactory


On 10/29/2012 5:28 PM, vybe3142 wrote:

I could well be doing something wrong here, but so far I haven't figured it
out. I currently  run SOLR 4 BETA  / multicore and I was investigating
migrating to SOLR 4.0 (on my workstation).

I've even backed out my custom schema and solrconfig so I'm running as close
to original as possible with no custom handlers etc.

Any idea which jar I'm missing? I should be referencing all the jars
provided in the 4.0 tar file.


That particular class is part of lucene-codecs.  This file is not part 
of the Solr 4.0 binary release, and is not generated by "ant dist" in 
the source tree under solr/.  The only way I know of to get it from the 
source is by doing "ant generate-maven-artifacts" in the root of the 
source tree.


You should not need it, however, unless you are attempting to change 
lucene codecs (postingsFormat) in your schema.  If this is what you are 
trying to do, you can ignore the following:


I am guessing that you still have leftover bits in your tomcat's 
deployment directory from the 4.0 beta .war file.  If that's the case, 
you can fix it by stopping tomcat, erasing everything in the deployment 
directory, and starting tomcat back up.  It will redeploy from the new 
.war file.


Thanks,
Shawn

row.get() in script transformer adds square brackets [] to string value

2012-10-29 Thread Radek Zajkowski

Hi all,

would you know why I get (notice square brackets)

[1969 Harley Davidson Ultimate Chopper]

not

1969 Harley Davidson Ultimate Chopper

when calling

var description = row.get("ProductName").toString();

in a script transformer?

Thank you,

Radek.

SOLR 4 / Tomcat Startup Error: java.lang.NoClassDefFoundError: org/apache/lucene/codecs/sep/IntStreamFactory

2012-10-29 Thread vybe3142

I could well be doing something wrong here, but so far I haven't figured it
out. I currently  run SOLR 4 BETA  / multicore and I was investigating
migrating to SOLR 4.0 (on my workstation).

I've even backed out my custom schema and solrconfig so I'm running as close
to original as possible with no custom handlers etc.

Any idea which jar I'm missing? I should be referencing all the jars
provided in the 4.0 tar file.

Thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-Tomcat-Startup-Error-java-lang-NoClassDefFoundError-org-apache-lucene-codecs-sep-IntStreamFacy-tp4016853.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: throttle segment merging

2012-10-29 Thread Radim Kolar


Dne 29.10.2012 12:18, Michael McCandless napsal(a):

With Lucene 4.0, FSDirectory now supports merge bytes/sec throttling
(FSDirectory.setMaxMergeWriteMBPerSec): it rate limits that max
bytes/sec load on the IO system due to merging.

Not sure if it's been exposed in Solr / ElasticSearch yet ...
its not available in solr. Also solr class hierarchy for directory 
providers is bit different from lucene. In solr, MMAP DF and NIOFSDF 
needs to be subclass of StandardDF. then add write limit property to 
standardDF and it will be inherited by others like in lucene.


solr
http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/core/CachingDirectoryFactory.html
lucene
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/FSDirectory.html

Re: SolrCloud and distributed search

2012-10-29 Thread Bill Au

Do updates always start at the shard leader first?  If so one can save one
internal request by only sending updates to the shard leader.  I am
assuming that when the shard leader is down, SolrJ's CloudSolrServer is
smart enough to use the newly elected shard leader after a failover has
occurred.  Am I correct?

Bill

On Fri, Oct 26, 2012 at 11:42 AM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> If you are going to use SolrJ, CloudSolrServer is even better than a
> round-robin load balancer for indexing, because it will send the documents
> straight to the shard leader (you save one internal request). If not,
> round-robin should be fine.
>
> Tomás
>
> On Fri, Oct 26, 2012 at 12:27 PM, Bill Au  wrote:
>
> > I am thinking of using a load balancer for both indexing and querying to
> > spread both the indexing and querying load across all the machines.
> >
> > Bill
> >
> > On Fri, Oct 26, 2012 at 10:48 AM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> > > You should still use some kind of load balancer for searches, unless
> you
> > > use the CloudSolrServer (SolrJ) which includes the load balancing.
> > > Tomás
> > >
> > > On Fri, Oct 26, 2012 at 11:46 AM, Erick Erickson <
> > erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Yes, I think SolrCloud makes sense with a single shard for exactly
> > > > this reason, NRT and multiple replicas. I don't know how you'd get
> NRT
> > > > on multiple machines without it.
> > > >
> > > > But do be aware of: https://issues.apache.org/jira/browse/SOLR-3971
> > > > "A collection that is created with numShards=1 turns into a
> > > > numShards=2 collection after starting up a second core and not
> > > > specifying numShards."
> > > >
> > > > Erick
> > > >
> > > > On Fri, Oct 26, 2012 at 10:14 AM, Bill Au 
> wrote:
> > > > > I am currently using one master with multiple slaves so I do have
> > high
> > > > > availability for searching now.
> > > > >
> > > > > My index does fit on a single machine and a single query does not
> > take
> > > > too
> > > > > long to execute.  But I do want to take advantage of high
> > availability
> > > of
> > > > > indexing and real time replication.  So it looks like I can set up
> > > > > SolrCloud with only 1 shard (ie numShards=1).
> > > > >
> > > > > In this case is SolrCloud still using distributed search behind the
> > > > > screen?  Will MoreLikeThis work?
> > > > >
> > > > > Does using SolrCloud with only 1 shard make any sense at all?
> > > > >
> > > > > Bill
> > > > >
> > > > > On Thu, Oct 25, 2012 at 4:29 PM, Tomás Fernández Löbbe <
> > > > > tomasflo...@gmail.com> wrote:
> > > > >
> > > > >> It also provides high availability for indexing and searching.
> > > > >>
> > > > >> On Thu, Oct 25, 2012 at 4:43 PM, Bill Au 
> > wrote:
> > > > >>
> > > > >> > So I guess one would use SolrCloud for the same reasons as
> > > distributed
> > > > >> > search:
> > > > >> >
> > > > >> > When an index becomes too large to fit on a single system, or
> > when a
> > > > >> single
> > > > >> > query takes too long to execute.
> > > > >> >
> > > > >> > Bill
> > > > >> >
> > > > >> > On Thu, Oct 25, 2012 at 3:38 PM, Shawn Heisey <
> s...@elyograg.org>
> > > > wrote:
> > > > >> >
> > > > >> > > On 10/25/2012 1:29 PM, Bill Au wrote:
> > > > >> > >
> > > > >> > >> Is SolrCloud using distributed search behind the scene?  Does
> > it
> > > > have
> > > > >> > the
> > > > >> > >> same limitations (for example, doesn't support MoreLikeThis)
> > > > >> distributed
> > > > >> > >> search has?
> > > > >> > >>
> > > > >> > >
> > > > >> > > Yes and yes.
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Solr4.0 / SolrCloud queries


On 10/29/2012 3:26 PM, shreejay wrote:

I am trying to run two SolrCloud with 3 and 2 shards respectively (lets say
Cloud3shards and Clouds2Shards). All servers are identical with 18GB Ram
(16GB assigned for Java).


This bit right here sets off warning bells right away.  You're only 
leaving 2GB of RAM for the OS to cache your index, which you later say 
is 50GB.  It's impossible for me to give you a precise figure, but I 
would expect that with an index that size, you'd want to have at least 
20GB of free memory, and if you can have 50GB or more of free memory 
after the OS and Java take their chunk, Solr would have truly excellent 
performance.  As it is now, your performance will be terrible, which 
probably explains all your issues.


It seems highly unlikely that you would have queries complex enough that 
you actually do need to allocate 16GB of RAM to Java.  Also, requesting 
large numbers of documents (the 5000 and 2 numbers you mentioned) is 
slow, and compounded in a cloud (distributed) index.  Solr is optimized 
for a small number of results.


First recommendation for fixing things: get more memory.  32GB would be 
a good starting point, 64GB would be better, so that the entire index 
will be able to fit in OS cache memory.  If you expect your index to 
grow at all, plan accordingly.


Second recommendation, whether or not you get more actual memory: Lower 
the memory that Java is using, and configure some alternate memory 
management options for Java.  Solr does have caching capability, but it 
is highly specialized.  For general index caching, the OS does a far 
better job, and it needs free memory in order to accomplish it.  Here's 
some commandline options for Java that I passed along to someone else on 
this list:


-Xmx4096M -Xms4096M -XX:NewRatio=1 -XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled


http://www.petefreitag.com/articles/gctuning/

Re: Doc Transformer to remove document from the response


: I did not look where pagination happens, but it looks like
: DocTransform gets applied at the very end (response writer), which in
: turn means pagination is not an issue , just soma pages might get
: shorter due to this additional filtering, but that is quite ok for me.

it depends on what you mean by "not an issue" ... i would argue that if a 
client asks for the first 10 matches, and you return a numFound of 678 but 
only give back 8 matches (because you have "excluded" two from that first 
page) that that's a bug.  

I think most people would agree that the "correct" way to exclude a 
document would be to tie into the logic of executing the executing the 
main query (like QEC does, or via a filter query) so that if a user asks 
for the first 10 documents, you give them the first 10 documents - no 
matter how many are being excluded.


-Hoss

Solr4.0 / SolrCloud queries

2012-10-29 Thread shreejay

Hi All, 

I am trying to run two SolrCloud with 3 and 2 shards respectively (lets say
Cloud3shards and Clouds2Shards). All servers are identical with 18GB Ram
(16GB assigned for Java). 

I am facing a few issues on both clouds and would be grateful if any one
else has seen / solved these.

1) Every now and then, Solr would take off one of the servers (It either
shows as "recovering" (orange) or its taken offline completely). The Logging
tab on Admin page shows these errors for Cloud3shards

/Error while trying to
recover:org.apache.solr.client.solrj.SolrServerException: Timeout occured
while waiting response from server at: http://xxx:8983/solr/xxx  /

/Error while trying to recover.
core=xxx:org.apache.solr.common.SolrException: I was asked to wait on state
recovering for xxx:8983_solr but I still do not see the request state. I see
state: recovering live:false/

On the Cloud2shards also I see similar messages

I have noticed it does happen more while indexing documents, but I have also
seen this happening while only querying Solr. 

Both SolrClouds are managed by the same Zookeeper ensemble (set of 3 ZK
servers). 

2) I am able to Commit but Optimize never seems to work. Right now I have an
average of 30 segments on every Solr Server. Has any one else faced this
issue? I have tried Optimize from the admin page and as a Http post request.
Both of them fail. Its not because of the hard disk space since my index
size is less than 50Gb and I have 500GB space on each server. 

3) If I try to query solr with rows = 5000 or more (for Cloud2) . for cloud1
its around 20,000 documents. 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request:[http://ABC1:8983/solr/aaa,
http://ABC2:8983/solr/aaa]. 

4) I have also noticed that ZK would switch leaders every now and then. I am
attributing it to point 1 above, where as soon as the leader is down,
another server takes its place. My concern is the frequency with which this
switch happens. I guess this is completely related to point 1 , and if that
is solved, I will not be having this issue either. 



--Shreejay





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-0-SolrCloud-queries-tp4016825.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doc Transformer to remove document from the response

2012-10-29 Thread eks dev

Thanks Hoss,
I probably did not formulate the question properly, but you gave me an answer.

I do it already in SearchComponent, just wanted to centralise this
control of the depth and width of the  response to the single place in
code  [style={minimal, verbose, full...}].

It just sounds logical to me to have this possibility in
DocTransformer, as null document is kind of "extremely" modified
document.

Even better, it might actually work… (did not try it yet)

@Override
setContext( TransformContext context ) {
 context.iterator = new FilteringIterator(context.iterator)
}

Simply by providing my own FilteringIterator that would skip document
I do not need in the response? Does this sound right from the
"legitimate api usage" perspective.
I did not look where pagination happens, but it looks like
DocTransform gets applied at the very end (response writer), which in
turn means pagination is not an issue , just soma pages might get
shorter due to this additional filtering, but that is quite ok for me.

On Mon, Oct 29, 2012 at 7:59 PM, Chris Hostetter
 wrote:
>
> : Transformer is great to augment Documents before shipping to response,
> : but what would be a way to prevent document from being delivered?
>
> DocTransformers can only modify the documents -- not hte Document List.
>
> what you are describing would have to be done as a SearchComponent (or in
> the QParser) -- take a look at QueryElevation component for an example of
> how to do something like this that plays nicely with pagination.
>
>
> -Hoss

Re: facet prefix with tokenized fields

In short, no. The problem is that faceting is working by counting
documents with distinct tokens in the field. So in your example
I'd expect you to see facets for "toys", "for", "children". All it
has to work with are the tokens, the fact that the original input
was three words is completely lost at this point.

You could index these with keywordTokenizer and facet on
_that_ field, which would work in this case. I don't know how
well that would fit into the rest of your app though.

Best
Erick

On Mon, Oct 29, 2012 at 8:03 AM, Grzegorz Sobczyk
 wrote:
> Hi.
> Is there any solution to facet documents with specified prefix on some
> tokenized field, but in result gets the original value of a field?
>
> e.q.:
>  stored="true" multiValued="true" />
>
> 
>   
>   
>   
>   
> 
>
>
> Indexed value: "toys for children"
> query:
> q=&start=0&rows=0&facet.limit=-1&facet.mincount=1&f.category_ac.facet.prefix=chi&facet.field=category_ac&facet=true
>
> I'd like to get exacly "toys for children", not "children"
>
> --
> Grzegorz Sobczyk

Re: Exception while getting Field info in Lucene

I suspect what's happening is that the index format changed
between 3.x and 4.x and somehow the Luke request
handler is getting mixed up. I'm further  guessing that
the -1 is just the default value, since it's clearly bogus
it's a flag the luke request handler just didn't see what it
expected...

Optimizing should fix this up since it'll re-write the entire index
into 4.x format. Running the index upgrade tool might work
too.

Best
Erick

On Mon, Oct 29, 2012 at 11:07 AM, adityab  wrote:
> Hi Erick,
>
> I have upgraded from 3.5 to 4.0 on first index build i was able to see the
> distinct terms on UI under Schema Browser section. I had to add a new date
> field as stored to schema and re-index.
> After that all the time on UI for every field i see the distinct term as
> "-1".
> Can you please advice if i am missing anything to configure?
>
> here is the luke response for field "_version_"
>
> http://.../solr/admin/luke?fl=_version_
>
>
> 
> long
> ITSOF-
> -TS---
> 16392300
> *-1*
> 
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 
> 
> 16392300
> 
> 
>
> Looks like this problem can be solved by performing optimize as posted in
> one of the archives
> http://lucene.472066.n3.nabble.com/always-getting-distinct-count-of-1-in-luke-response-solr4-snapshot-td3985546.html
>
> Just curious what makes this value "-1"
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exception-while-getting-Field-info-in-Lucene-tp4016448p4016707.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Any way to by pass the checking on QueryElevationComponent


: We are currently working on having Solr files read from HDFS. We extended
: some of the classes so as to avoid modifying the original Solr code and
: make it compatible with the future release. So here comes the question, I
: found in QueryElevationComponent, there is a piece of code checking whether
: elevate.xml exists at local file system. I am wondering if there is a way
: to by pass this?

I haven't looked closely, but i suspect this code is just kind of old and 
could be cleaned up to play more nicely with SolrCloud (and thus your HDFS 
work as well)

The teory behind this code is that we wnat to support two distinct use 
cases: reading the elevation file either from config (once on SolrCore 
creation) or from data (on every Indexreader reload).

the use of getConfigDir() to check for the elevation file could probably 
be replaced by a straight call to openResource (with null check).  the 
check for the elevation file in the data dir could ... maybe? .. be 
delegated to the DirectoryFactory? ... i'm not sure about that part, i 
havne't kept close tabs on some of hte improvements miller has made with 
that abstraction lately.

(BTW: I think some of this was also already fixed in SOLR-3522, but i 
haven't looked at how exactly)


The simplest answer would be just recognize that not every feature of solr 
can work in conjunction with every possible solr plugin -- perhaps if 
people wnat to use your "HdfsDirectoryFactory" they just have to accept 
that they can only use the "conf" style elevate.xml instead of the "data" 
option?


-Hoss

Re: improving score of result set

2012-10-29 Thread yunfei wu

I agree with Chris Hostetter that we might not be able to provide
suggestions for the use cases unless there are clear reasons provided
("don't like the order" is the feeling, not the reason how you want to
adjust the orders).

- if you want to put some results on top based on some terms regardless of
the scores, you could consider "elevation";
- if you consider that orders cannot satisfy you, consider changing the
search query to make them more specific then the hit docs you want can have
higher scores.
- if improving the search query is difficult to satisfy the use case (I
think it shall be enough) and you know what factors you can do to improve
the scoring, you can customize the scoring algorithm to inject the score
calculation to make them have higher scores.
- if you want to show more results in a page (with correct hit counts and
collapse similiar results from same site), consider using "field
collapsing".

Thanks,
Yunfei Wu



On Mon, Oct 29, 2012 at 12:28 PM, Alexander Aristov <
alexander.aris...@gmail.com> wrote:

> You absolutely follow my problem. I want to put Obama from espn atop just
> because this is exceptional and probably interesting occurance. And the
> score is low because content is long or there are no matches in title.
> 29.10.2012 23:18 пользователь "Chris Hostetter" 
> написал:
>
> >
> > You haven't really explained things enough for us to help you...
> >
> > : First of all I don't have a site which I want to boost. All docs are
> > equal.
> > :
> > : Secondly I will explain what I have. I have 100 docs indexed. I do a
> > query
> > : which returns 10 found docs. 8 of them from one site and 2 from other
> > : different sites. I dont like order. Technically scores are good. I
> > : understand why these 8 docs go first - because they havebetter
> matching.
> > : But i dont like it. I want that articles from smaller collections would
> > : somehow compete with other docs. For other queries situation can change
> > and
> > : another site can produce more results. In that case i would  lower that
> > : site.
> >
> > *why* don't you like that order?  what is it that makes you think that
> > order is bad? you say you want to articles fro mteh smalller collection
> > to "compete" with the other docs -- but they already have.  unless part
> of
> > your query included a clause that is biased in favor of one "collection"
> > then all of those documents got a "fair" score for the query you passed
> > in.
> >
> > It might help if you gave us a specific, concrete example of some *real*
> > queries and the *real* docments they return, and why you don't think
> those
> > scores are fair.
> >
> > Because if i'm following your reasoning, and thinking about a situation
> > where i might have an index full of webpages, and some of those web pages
> > are from "cnn.com" and some of those pages are from "espn.com" then a
> > query for "Obama" might match lots of pages from cnn.com, with "high"
> > scores, and there might be *one* match on espn.com with an extremely low
> > score, because Obama is mentioned one time in some quote or something in
> a
> > *very* long page ... in what situation would it make any sense to bias
> the
> > score of that one espn.com document to make it score higher then other
> > documents from cnn.com that legitimately score better because they
> mention
> > Obama in the title, or many times in the body of the page?
> >
> >
> > -Hoss
> >
>

Re: SOLR - To point multiple indexes in different folder

How did you get the 7 directories anyway? From your message,
they sound like they are _solr_ indexes, in which case you
somehow created then with Solr. But I don't really understand
the setup in that case.

If these are Solr/Lucene indexes, you can use the "multicore"
features. This treats them like separate indexes and you have
to address each specifically, something like ...locahost/solr/collection2/select
etc.

Sharding, on the other hand, _assumes_ that all the indexes
really make up one logical index and handles the distribution/collation
automatically.

If this makes no sense, could you explain your setup a little more?

Best
Erick

On Mon, Oct 29, 2012 at 7:34 AM, ravi.n  wrote:
> Hello Solr Gurus,
>
> I am newbie to solr application, below are my requirements:
>
> 1. We have 7 folders having indexed files, which SOLR application to be
> pointed. I understand shards feature can be used for searching. If there is
> any other alternative. Each folder has around 24 million documents.
> 2. We should configure solr for indexing new incoming data from database/SCV
> file, whtas is the required configuration in solr to achieve this?
>
> Any quick response on this will be appreciated.
> Thanks
>
> Regards,
> Ravi
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-To-point-multiple-indexes-in-different-folder-tp4016640.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: improving score of result set

You absolutely follow my problem. I want to put Obama from espn atop just
because this is exceptional and probably interesting occurance. And the
score is low because content is long or there are no matches in title.
29.10.2012 23:18 пользователь "Chris Hostetter" 
написал:

>
> You haven't really explained things enough for us to help you...
>
> : First of all I don't have a site which I want to boost. All docs are
> equal.
> :
> : Secondly I will explain what I have. I have 100 docs indexed. I do a
> query
> : which returns 10 found docs. 8 of them from one site and 2 from other
> : different sites. I dont like order. Technically scores are good. I
> : understand why these 8 docs go first - because they havebetter matching.
> : But i dont like it. I want that articles from smaller collections would
> : somehow compete with other docs. For other queries situation can change
> and
> : another site can produce more results. In that case i would  lower that
> : site.
>
> *why* don't you like that order?  what is it that makes you think that
> order is bad? you say you want to articles fro mteh smalller collection
> to "compete" with the other docs -- but they already have.  unless part of
> your query included a clause that is biased in favor of one "collection"
> then all of those documents got a "fair" score for the query you passed
> in.
>
> It might help if you gave us a specific, concrete example of some *real*
> queries and the *real* docments they return, and why you don't think those
> scores are fair.
>
> Because if i'm following your reasoning, and thinking about a situation
> where i might have an index full of webpages, and some of those web pages
> are from "cnn.com" and some of those pages are from "espn.com" then a
> query for "Obama" might match lots of pages from cnn.com, with "high"
> scores, and there might be *one* match on espn.com with an extremely low
> score, because Obama is mentioned one time in some quote or something in a
> *very* long page ... in what situation would it make any sense to bias the
> score of that one espn.com document to make it score higher then other
> documents from cnn.com that legitimately score better because they mention
> Obama in the title, or many times in the body of the page?
>
>
> -Hoss
>

Re: De-normalize one-to-many, cannot default multivalued field

I don't know if there is an easy way to address the key crux of your 
question...

: only applied of there are 0 values for the field.  Is there a way when
: using the DIH to replace 'null' or missing values with a default, such
: that I can ensure that I always have the same number of values in each
: multivalued field?

...but one other idea i'd like to put out there for you to consider is to 
change how you think about your index, so that instead of having one 
document per "root" (in your sample data) you create one document per 
"style" -- this would let you query on individual styles very easily, and 
you could use Grouping when you need to find "root" products...

https://wiki.apache.org/solr/FieldCollapsing

...I don't know enough about your usecase to know if this approach would 
make your life easier or more complicated, but it's something to consider.

-Hoss

Re: improving score of result set


You haven't really explained things enough for us to help you...

: First of all I don't have a site which I want to boost. All docs are equal.
: 
: Secondly I will explain what I have. I have 100 docs indexed. I do a query
: which returns 10 found docs. 8 of them from one site and 2 from other
: different sites. I dont like order. Technically scores are good. I
: understand why these 8 docs go first - because they havebetter matching.
: But i dont like it. I want that articles from smaller collections would
: somehow compete with other docs. For other queries situation can change and
: another site can produce more results. In that case i would  lower that
: site.

*why* don't you like that order?  what is it that makes you think that 
order is bad? you say you want to articles fro mteh smalller collection 
to "compete" with the other docs -- but they already have.  unless part of 
your query included a clause that is biased in favor of one "collection" 
then all of those documents got a "fair" score for the query you passed 
in.

It might help if you gave us a specific, concrete example of some *real* 
queries and the *real* docments they return, and why you don't think those 
scores are fair.

Because if i'm following your reasoning, and thinking about a situation 
where i might have an index full of webpages, and some of those web pages 
are from "cnn.com" and some of those pages are from "espn.com" then a 
query for "Obama" might match lots of pages from cnn.com, with "high" 
scores, and there might be *one* match on espn.com with an extremely low 
score, because Obama is mentioned one time in some quote or something in a 
*very* long page ... in what situation would it make any sense to bias the 
score of that one espn.com document to make it score higher then other 
documents from cnn.com that legitimately score better because they mention 
Obama in the title, or many times in the body of the page?


-Hoss

Re: How to efficiently find documents that have a specific value for a field OR the field does not exist at all


: > field:"value" OR (*:* AND NOT field:[* TO *])

: Instead of field:[* TO *], you can define a default value in schema.xml. 
: Or DefaultValueUpdateProcessorFactory in solrconfig.

right -- the most efficient way to query for this kind of "has value in 
fieldX" or "does not have a value in fieldX" is to index that specific 
piece of information in an easily querable way.

either using a default value, or with a new boolean field "has_fieldX" 
that you can then query on


-Hoss

Re: improving score of result set

Perhapse this is a XY problem.

First of all I don't have a site which I want to boost. All docs are equal.

Secondly I will explain what I have. I have 100 docs indexed. I do a query
which returns 10 found docs. 8 of them from one site and 2 from other
different sites. I dont like order. Technically scores are good. I
understand why these 8 docs go first - because they havebetter matching.
But i dont like it. I want that articles from smaller collections would
somehow compete with other docs. For other queries situation can change and
another site can produce more results. In that case i would  lower that
site.

I've had a deep thought and think can try grouping.

More insites on my problem. These 8 docs have similar text which matches
query and  thats why they all get similar and relatively high score. For
example docs have text:

1. Red apple felt from tree
2 blue apple felt from tree
3 green apple felt from tree
...
8 orange pineapple felt from tree
9 a boy felt suddenly ill. A tree was green.
10 two pices felt apart and newer collapse. Family tree was reach.

I query "felt tree". Docs 1-8 from one site.

I would like to make the score of docs 9 and 10 higher.

Grouping can help but maybe there are othe solutions.

Alexander
 29.10.2012 22:11 пользователь "Chris Hostetter" 
написал:

>
> You've mentioned that you want ot "improve" the scores of these documents,
> but you haven't really given any specifics about when/how/why you wnat to
> improve the score in general -- ie: in this examples you have a total of
> 10 docs, but how do you distinguish the 2 special docs from the 8 other
> docs?  is it because they are the only two docs with some specific
> field value, or is it just because they are in the smaller of two "sets"
> of documents if you partition on some field?  if you added 100 more docs
> that were all in the same set as those two, would you want the other 8
> documents to start getting boosted?
>
> Let's assume that what you are trying to ask is..
>
>   "I want to artificially boost the scores of documents when the 'site'
>field contains 'cnn.com'"
>
> A simple way to do that is just to add an optional clause to your query
> that matches on "site:cnn.com" so the scores of those documents will be
> increased, but make the "main" part of your query required...
>
>q=+(your main query) site:cnn.com
>
> Or if you use the dismax or edismax parsers there are special params (bq
> and/or boost) that help make this easy to split out...
>
>
> https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
>
>
>
> FWIW: this smells like an XY problem ... more details baout your actaul
> situation and end goal would be helpful...
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
>
> -Hoss
>

Re: Doc Transformer to remove document from the response


: Transformer is great to augment Documents before shipping to response,
: but what would be a way to prevent document from being delivered?

DocTransformers can only modify the documents -- not hte Document List.

what you are describing would have to be done as a SearchComponent (or in 
the QParser) -- take a look at QueryElevation component for an example of 
how to do something like this that plays nicely with pagination.


-Hoss

Re: Jetty / Solr memory consumption

2012-10-29 Thread Nicolai Scheer

Hi again!

On 29 October 2012 18:39, Nicolai Scheer  wrote:
> Hi!
>
> We're currently facing a strange memory issue we can't explain, so I'd
> like to kindly ask if anyone is able to shed a light an the behavour
> we encounter.
>
> We use a Solr 3.5 instance on a Windows Server 2008 machine equipped
> with 16GB of ram.
> The index uses 8 cores, 10 million documents, disk size of 180 GB in total.
> The machine is only used for searches, text extraction is done on another box.
[...]

I should add which java version we're using:

java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

Greetings

Nico

Re: improving score of result set

You've mentioned that you want ot "improve" the scores of these documents,
but you haven't really given any specifics about when/how/why you wnat to
improve the score in general -- ie: in this examples you have a total of
10 docs, but how do you distinguish the 2 special docs from the 8 other
docs? is it because they are the only two docs with some specific
field value, or is it just because they are in the smaller of two "sets"
of documents if you partition on some field? if you added 100 more docs
that were all in the same set as those two, would you want the other 8
documents to start getting boosted?

Let's assume that what you are trying to ask is..

"I want to artificially boost the scores of documents when the 'site'
field contains 'cnn.com'"

A simple way to do that is just to add an optional clause to your query
that matches on "site:cnn.com" so the scores of those documents will be
increased, but make the "main" part of your query required...

q=+(your main query) site:cnn.com

Or if you use the dismax or edismax parsers there are special params (bq
and/or boost) that help make this easy to split out...

https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents

FWIW: this smells like an XY problem ... more details baout your actaul
situation and end goal would be helpful...

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue. Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

-Hoss

Re: improving score of result set

I think I get it right way.

Referring back to my example.

I will get 3 groups:
Large group with 8 documents in it and
two other groups with one document in each

If I limit a group by 5 docs then 1st group will have only 5 docs and the
other two will stay contain one doc.

And the order (based on score) won't be different. Each document in the
first group will have higher score,won't it? Or document score in each
group is calculated relatively so that top docs have similar score?

So this approach just limits number of similar documents. Instead I want to
keep all documents in results but shuffle them appropriately.

Best Regards
Alexander Aristov


On 29 October 2012 15:55, Erick Erickson  wrote:

> I don't think you're reading the grouping right. When you use grouping,
> you get the top N groups, and within each group you get the top M
> scoring documents. So you can actually get _more_ documents back than in
> the non-grouping case and your app can then intelligently intersperse them
> however you want.
>
> Best
> Erick
>
> On Mon, Oct 29, 2012 at 5:02 AM, Alexander Aristov
>  wrote:
> > Interesting but not exactly what I want to get.
> >
> > If I group items then I will get small number of docs. I don't want
> this. I
> > need all of them.
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 29 October 2012 12:05, yunfei wu  wrote:
> >
> >> Besides changing the scoring algorithm, what about "Field Collapsing" -
> >> http://wiki.apache.org/solr/FieldCollapsing - to collapse the results
> from
> >> same website url?
> >>
> >> Yunfei
> >>
> >>
> >> On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov <
> >> alexander.aris...@gmail.com> wrote:
> >>
> >> > Hi everybody,
> >> >
> >> > I have a question about scoring calculation algorithms and approaches.
> >> >
> >> > Lets say I have 10 documents. 8 of the them come from one web site (I
> >> have
> >> > a field in schema with URL) and the other 2 from other different web
> >> sites.
> >> > So for this example I have 3 web sites.
> >> >
> >> > For some queries those 8 documents have better terms matching and they
> >> > appear at the top of results. It makes that 8 docs from one source
> come
> >> > first and the other two come next and the last.
> >> >
> >> > I want to maybe artificially improve score of those 2 docs and put
> them
> >> > atop. I don't want that they necessarily go first but if they come in
> the
> >> > middle of the result set it would be perfect.
> >> >
> >> > One of the ideas is to reduce score for docs in the result set from
> one
> >> > site so that if it contains too many docs from one source total
> scoring
> >> of
> >> > each those docs would be reduced proportionally.
> >> >
> >> > Important thing is that I don't want to reduce doc score permanently.
> >> Only
> >> > at query time. Maybe some functional queries can help me?
> >> >
> >> > How can I do this or maybe there are other ideas.
> >> >
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >>
>

Jetty / Solr memory consumption

2012-10-29 Thread Nicolai Scheer

Hi!

We're currently facing a strange memory issue we can't explain, so I'd
like to kindly ask if anyone is able to shed a light an the behavour
we encounter.

We use a Solr 3.5 instance on a Windows Server 2008 machine equipped
with 16GB of ram.
The index uses 8 cores, 10 million documents, disk size of 180 GB in total.
The machine is only used for searches, text extraction is done on another box.

We run Solr on Jetty, installed as a service using procrun.
The only adjusted JVM parameter is

-XX:MaxPermSize=256M

When we connect to the Jetty / Solr instance with JVisualVM for
debugging purposes we see that MaxHeapSize is selected to 3.8 GB
automatically.

After approx. 48 hours after startup and frequent search activity, we
see the following:

The "JettyService.exe" process eats up nearly all available memory.

Windows Resource Monitor reports for this process:

Working Set (KB)
14.317.768

Shareable (KB)
9.928.032

Private (KB)
4.389.736

This results in machine totals of:

Hardware Reserved: 3MB
In Use: 16155 MB
Modified: 7 MB
Standby: 206 MB
Free: 13 MB

Available: 219 MB
Cached: 213 MB
Total: 16381 MB
Installed: 16384

Actually there's no free memory for other processes left. We are
currently quite unsure how to interpret these values.

To my mind the total memory limit of Solr should be

Max Heap Size + PermSpace + Stack + JVM stuff

with JVM stuff and Stack being a few hundred megabytes or so, but not gigabytes.

The "private" value, being around 4.3 GB is the size I expected for
memory consumption (i.e. 3.8 GB heap + few hundred megs on top for
other stuff). Usually I'd expect the huge amount of memory listed as
"shareable" to appear under "standby" as well, meaning that it denotes
cached memory that is used by java, but might be freed if requested.
The small amount of only 213 MB cached memory indicates that there's
no headroom left for file system cache etc.

My questions:

1) Is the equation Max Heap Size + PermSpace + Stack + JVM stuff
roughly correct or is there anything fundamental missing that can
consume that much memory?

2) Since the private memory seems about to be ok, can anyone explain
why there is such a huge amount of shareable memory used that does not
seem to be available to other processed (i.e. locked in any way?!)

3) Is there something Solr-specific that adds to the memory equation,
i.e. uses memory but not from the heap pool?

Any help is appreciated!

Thanks!

Greetings

Nico

Re: hot shard concept


On 10/29/2012 7:55 AM, Dmitry Kan wrote:

Hi everyone,

at this year's Berlin Buzz words conference someone (sematext?) have
described a technique of a hot shard. The idea is to have a slim shard to
maximize the update throughput during a day (when millions of docs need to
be posted) and make sure the indexed documents are immediately searchable.
In the end of the day the day's documents are moved to cold shards. If I'm
not mistaken, this was implemented for ElasticSearch. I'm currently
implementing something similar (but pretty tailored to our logical sharding
use case) for Solr (3.x). The feature set looks roughly like this:

1) front end solr (query router) is aware of the hot shard: it directs the
incoming queries to the hot and "cold" shards.
2) new incoming documents are directed first to the hot shard and then
periodically (like once a day or once a week) moved over to the closest in
time cold shard. And for that...
3) hot shard index is being partitioned low level using Lucene's
IndexReader / IndexWriter with the implementation based on [1], [2] and
customized to logical (time-based) sharding.


The question is: is doing index partitioning low-level a good way of
implementing the hot shard concept? That is, is there anything better
operationally-wise from the point of view of disaster recovery / search
cluster support? Am I missing some obvious SOLR-ish solution?
Doing instead the periodical hot shard cleaning and re-posting its source
documents to the closest cold shard is less modular and hence more
complicated operationally for us.

Please let me know, if you need more details or if the problem isn't clear
enough. Thanks.

[1]
http://blog.foofactory.fi/2008/01/regenerating-equally-sized-shards-from.html
[2] https://github.com/HON-Khresmoi/hash-based-index-splitter


This is exactly how I set up my indexing, been that way since early 2010 
when we first started using Solr 1.4.0.  Now we are on 3.5 and an 
upgrade to 4.1 (branch_4x) is in the works.  Coming from terminology 
used in our previous search product, we call the hot shard an 
"incremental" shard.  My SolrJ indexing application takes care of all 
management of which documents are in the incremental and which documents 
are in the large shards.  We call the large ones "static" shards because 
deletes and the occasional reinsert are the only updates that they 
receive, except for the daily distribute process.


We don't do anything to send queries to the hot shard "first" ... it is 
simply listed first in the shards parameter on what we call the broker 
core.  Average response time on the incremental is single digit, the 
other shards average at about 30 to 40 milliseconds. Median numbers 
(SOLR-1972 patch) are much better.


Thanks,
Shawn

Re: Is it possible to use something like sum() in a solr-query?

Maybe we do need to think more seriously about some of those higher-level 
SQL-like features.


Function queries, which can now be used as "pseudo-fields" in the "fl" 
parameter do have a "sum" function, but that merely adds an explicit list of 
functions/field names for a single document.


Maybe what we need is a separate parameter for such aggregate functions.

-- Jack Krupansky

-Original Message- 
From: Markus.Mirsberger

Sent: Monday, October 29, 2012 11:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to use something like sum() in a solr-query?

I have for example an integer field and want to sum all these values for
all the matching documents.

Similar to this in sql:

SELECT SUM(/expression/ )
FROM tables
WHERE predicates;


Regards,
Markus

On 29.10.2012 22:25, Jack Krupansky wrote:

Unfortunately, neither the subject nor your message says it all. Be
specific - what exactly do you want to sum? All matching docs? Just
the returned docs? By group? Or... what?

You can of course develop your own search component that does whatever
it wants with the search results.

-- Jack Krupansky

-Original Message- From: Markus.Mirsberger
Sent: Monday, October 29, 2012 11:08 AM
To: solr-user@lucene.apache.org
Subject: Is it possible to use something like sum() in a solr-query?

Hi,

the subject says it all :)
Is there something like sum() available in a solr query to sum all
values of a field ?

Regards,
Markus Mirsberger

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Glen Newton

+10

On Mon, Oct 29, 2012 at 12:17 PM, Michael Della Bitta
 wrote:
> As an external observer, I think the main problem is your branding.
> "Realtime Near Realtime" is definitely an oxymoron, and your ranking
> algorithm is called "Ranking Algorithm," which is generic enough to
> suggest that a. it's the only ranking algorithm available, and b. by
> implication, that Solr doesn't have one built in.
>
> I would suggest two improvements:
>
> 1. Come up with a top-level name for your overall efforts. Apache
> Foundation has 'Apache,' so automatic branding of every component they
> build. Then your ranking algorithm could be called "Tgels Ranking
> Algorithm for Apache Solr" (for example), which is totally legit. And
> "Tgels Realtime Search for Apache Solr."
>
> 2. Maybe point out that you're building on top of the work of the
> Apache Solr and Lucene projects a little more prominently.
>
> I think with those two little tweaks, you'd actually very easily get
> more people interested in your contributions.
>
> Just my two cents,
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Mon, Oct 29, 2012 at 11:35 AM, Nagendra Nagarajayya
>  wrote:
>>
>> Jack:
>>
>> I respect your hard-work responding to user problems on the mail list. So it
>> would be nicer to try out Realtime NRT then pass rogue comments, whether a
>> contribution is legit/spam or a scam... I guess it illuminates the narrow
>> minded view of oneself ...  The spirit of open source is contributions from
>> not only commiters but other developers, from the Solr wiki "A half-baked
>> patch in Jira, with no documentation, no tests and no backwards
>> compatibility is better than no patch at all."
>>
>> You would gain more respect if you actually download realtime-nrt, check out
>> if it does provide a view of a realtime index compared to a  point-in-time
>> snapshot, see if you can understand the code and provide clarity  and
>> feedback to the list if you do find problems with it. realtime-nrt offers
>> search capability as to realtime-get. Checkout if  this is true ... I would
>> really welcome your comments on the list or through the JIRA here:
>>
>> https://issues.apache.org/jira/browse/SOLR-3816
>>
>>
>> Regards,
>>
>> Nagendra Nagarajayya
>> http://solr-ra.tgels.org
>> http://rankingalgorithm.tgels.org
>>
>> On 10/29/2012 7:30 AM, Jack Krupansky wrote:
>>>
>>> Could any of the committers here confirm whether this is a legitimate
>>> effort? I mean, how could anything labeled "Apache ABC with XYZ" be an
>>> "external project" and be sanctioned/licensed by Apache? In fact, the linked
>>> web page doesn't even acknowledge the ownership of the Apache trademarks or
>>> ASL. And the term "Realtime NRT" is nonsensical. Even worse: "Realtime NRT
>>> makes available a near realtime view". Equally nonsensical. Who knows, maybe
>>> it is legit, but it sure comes across as a scam/spam.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Nagendra Nagarajayya
>>> Sent: Monday, October 29, 2012 10:06 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and
>>> Realtime NRT available for download
>>>
>>> Hi!
>>>
>>> I am very excited to announce the availability of Apache Solr 4.0 with
>>> RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
>>> performance and more granular NRT implementation as to soft commit. The
>>> update performance is about 70,000 documents / sec* (almost 1.5-2x
>>> performance improvement over soft-commit). You can also scale up to 2
>>> billion documents* in a single core, and query half a billion documents
>>> index in ms**. Realtime NRT is different from realtime-get. realtime-get
>>> does not have search capability and is a lookup by id. Realtime NRT
>>> allows full search, see here 
>>> for more info.
>>>
>>> Realtime NRT has been contributed back to Solr, see JIRA:
>>> https://issues.apache.org/jira/browse/SOLR-3816
>>>
>>> RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ą and/or
>>> boolean/dismax/boost queries and is compatible with the new Lucene 4.0
>>> api.
>>>
>>> You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
>>> and Realtime NRT performance from here:
>>> http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
>>>
>>> You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
>>> http://solr-ra.tgels.org
>>>
>>> Please download and give the new version a try.
>>>
>>> Note:
>>> 1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project
>>>
>>> Regards,
>>>
>>> Nagendra Nagarajayya
>>> http://solr-ra.tgels.org
>>> http://rankingalgorithm.tgels.org
>>>
>>> * performance is a real use case of Apache Solr with RankingAlgorithm as
>>> seen at a user installation
>>> ** performance seen when

Re: Any way to by pass the checking on QueryElevationComponent

2012-10-29 Thread James Ji

We want to put all files into a file system(HDFS). It is easy to maintain
if both config and index files are in the same place. You are right, we can
put a dummy one to by pass that. But I have another idea. We subclass the
Directory class to handle all the files access. If QueryElevationComponent
use Directory class to check file existence, it should not have this
problem. I am wondering if this is a bug.

Any thought will be appreciated.


On Mon, Oct 29, 2012 at 1:26 AM, Amit Nithian  wrote:

> Is the goal to have the elevation data read from somewhere else? In
> other words, why don't you want the elevate.xml to exist locally?
>
> If you want to read the data from somewhere else, could you put a
> dummy elevate.xml locally and subclass the QueryElevationComponent and
> override the loadElevationMap() to read this data from your own custom
> location?
>
> On Fri, Oct 26, 2012 at 6:47 PM, James Ji  wrote:
> > Hi there
> >
> > We are currently working on having Solr files read from HDFS. We extended
> > some of the classes so as to avoid modifying the original Solr code and
> > make it compatible with the future release. So here comes the question, I
> > found in QueryElevationComponent, there is a piece of code checking
> whether
> > elevate.xml exists at local file system. I am wondering if there is a way
> > to by pass this?
> > QueryElevationComponent.inform(){
> > 
> > File fC = new File(core.getResourceLoader().getConfigDir(), f);
> > File fD = new File(core.getDataDir(), f);
> > if (fC.exists() == fD.exists()) { throw new
> > SolrException(SolrException.ErrorCode.SERVER_ERROR,
> > "QueryElevationComponent missing config file: '" + f + "\n" + "either: "
> +
> > fC.getAbsolutePath() + " or " + fD.getAbsolutePath() + " must exist, but
> > not both."); }
> > if (fC.exists()) { exists = true; log.info("Loading QueryElevation from:
> > "+fC.getAbsolutePath()); Config cfg = new
> Config(core.getResourceLoader(),
> > f); elevationCache.put(null, loadElevationMap(cfg)); }
> > 
> > }
> >
> > --
> > Jiayu (James) Ji,
> >
> > ***
> >
> > Cell: (312)823-7393
> > Website: https://sites.google.com/site/jiayuji/
> >
> > ***
>

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Michael Della Bitta

As an external observer, I think the main problem is your branding.
"Realtime Near Realtime" is definitely an oxymoron, and your ranking
algorithm is called "Ranking Algorithm," which is generic enough to
suggest that a. it's the only ranking algorithm available, and b. by
implication, that Solr doesn't have one built in.

I would suggest two improvements:

1. Come up with a top-level name for your overall efforts. Apache
Foundation has 'Apache,' so automatic branding of every component they
build. Then your ranking algorithm could be called "Tgels Ranking
Algorithm for Apache Solr" (for example), which is totally legit. And
"Tgels Realtime Search for Apache Solr."

2. Maybe point out that you're building on top of the work of the
Apache Solr and Lucene projects a little more prominently.

I think with those two little tweaks, you'd actually very easily get
more people interested in your contributions.

Just my two cents,

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Oct 29, 2012 at 11:35 AM, Nagendra Nagarajayya
 wrote:
>
> Jack:
>
> I respect your hard-work responding to user problems on the mail list. So it
> would be nicer to try out Realtime NRT then pass rogue comments, whether a
> contribution is legit/spam or a scam... I guess it illuminates the narrow
> minded view of oneself ...  The spirit of open source is contributions from
> not only commiters but other developers, from the Solr wiki "A half-baked
> patch in Jira, with no documentation, no tests and no backwards
> compatibility is better than no patch at all."
>
> You would gain more respect if you actually download realtime-nrt, check out
> if it does provide a view of a realtime index compared to a  point-in-time
> snapshot, see if you can understand the code and provide clarity  and
> feedback to the list if you do find problems with it. realtime-nrt offers
> search capability as to realtime-get. Checkout if  this is true ... I would
> really welcome your comments on the list or through the JIRA here:
>
> https://issues.apache.org/jira/browse/SOLR-3816
>
>
> Regards,
>
> Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.org
>
> On 10/29/2012 7:30 AM, Jack Krupansky wrote:
>>
>> Could any of the committers here confirm whether this is a legitimate
>> effort? I mean, how could anything labeled "Apache ABC with XYZ" be an
>> "external project" and be sanctioned/licensed by Apache? In fact, the linked
>> web page doesn't even acknowledge the ownership of the Apache trademarks or
>> ASL. And the term "Realtime NRT" is nonsensical. Even worse: "Realtime NRT
>> makes available a near realtime view". Equally nonsensical. Who knows, maybe
>> it is legit, but it sure comes across as a scam/spam.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Nagendra Nagarajayya
>> Sent: Monday, October 29, 2012 10:06 AM
>> To: solr-user@lucene.apache.org
>> Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and
>> Realtime NRT available for download
>>
>> Hi!
>>
>> I am very excited to announce the availability of Apache Solr 4.0 with
>> RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
>> performance and more granular NRT implementation as to soft commit. The
>> update performance is about 70,000 documents / sec* (almost 1.5-2x
>> performance improvement over soft-commit). You can also scale up to 2
>> billion documents* in a single core, and query half a billion documents
>> index in ms**. Realtime NRT is different from realtime-get. realtime-get
>> does not have search capability and is a lookup by id. Realtime NRT
>> allows full search, see here 
>> for more info.
>>
>> Realtime NRT has been contributed back to Solr, see JIRA:
>> https://issues.apache.org/jira/browse/SOLR-3816
>>
>> RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ą and/or
>> boolean/dismax/boost queries and is compatible with the new Lucene 4.0
>> api.
>>
>> You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
>> and Realtime NRT performance from here:
>> http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
>>
>> You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
>> http://solr-ra.tgels.org
>>
>> Please download and give the new version a try.
>>
>> Note:
>> 1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project
>>
>> Regards,
>>
>> Nagendra Nagarajayya
>> http://solr-ra.tgels.org
>> http://rankingalgorithm.tgels.org
>>
>> * performance is a real use case of Apache Solr with RankingAlgorithm as
>> seen at a user installation
>> ** performance seen when using the age feature
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Management of solr.xml on master server


On 10/28/2012 11:34 PM, maneesha wrote:

For creating the full index for my core every quarter from scratch:
  - I create a new core e.g. "blahNov2012" using admin url with option
action=CREATE and I give it a new dataDir property e.g.
/home/blah/data/Nov2012.
- I do a full import on blahNov2012 to populate the new core "blah-Nov2012"
and test it.
- If all is good, I run the admin url with option action=SWAP to swap (blah
with blahNov2012).
- Since I have persistent="true" in the solr.xml, it updates the dataDir for
the  core "blah" to point to the new directory /home/blah/data/Nov2012.





My first question: is this the pattern most people use to create fresh index
(e.g. create a new tmp core, test, and swap).

My second question is if I need to make further unrelated changes the
solr.xml in next releases and I must update the solr.xml on the production
system, I need to manually change the "blah"'s data directory from
"/home/blah/data/defaultData" to "/home/blah/data/Nov2012". Is that what
needs to be done? Do people automate this step somehow in their production
releases..?


Here's my solr.xml:



  
dataDir="../../data/ncmain"/>
dataDir="../../data/ncrss"/>
dataDir="../../data/inc_0"/>
dataDir="../../data/inc_1"/>
dataDir="../../data/s0_0"/>
dataDir="../../data/s0_1"/>
dataDir="../../data/s1_0"/>
dataDir="../../data/s1_1"/>
dataDir="../../data/s2_0"/>
dataDir="../../data/s2_1"/>
dataDir="../../data/s3_0"/>
dataDir="../../data/s3_1"/>
dataDir="../../data/s4_0"/>
dataDir="../../data/s4_1"/>
dataDir="../../data/s5_0"/>
dataDir="../../data/s5_1"/>

  


I do not worry about creating cores on the fly, and in normal operation 
I never have to manually touch the solr.xml file.  I have a live core 
and a build core for every shard.  When I have to completely rebuild an 
index, I clear the build core and do the build there.  When it's done, I 
swap the build core and the live core, then go back and redo any updates 
that were applied to the live core while the rebuild was happening.  The 
core names indicate live or build, but the directory names just have _0 
or _1, so there's never any need to rename anything.  At one time I did 
have a test core per shard and directories ending in _2, but over the 
course of a full year I never used it once, so I got rid of it.


Thanks,
Shawn

Re: SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + multibyte chars => crash?

2012-10-29 Thread Tomas Zerolo

On Mon, Oct 29, 2012 at 08:55:27AM -0700, Ahmet Arslan wrote:
> Hi Tomas,
> 
> I think this is same case Marian reported before.
> 
> https://issues.apache.org/jira/browse/SOLR-3193
> https://issues.apache.org/jira/browse/SOLR-3901

Thanks, Ahmet. Yes, by the descriptions they look very similar. I'll
try to follow up on the bug reports.

Regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele

Re: facet prefix with tokenized fields

2012-10-29 Thread Grzegorz Sobczyk


I'd like to use faceting. I don't want list of documents.

Using ngram would give me response which is useless for me.

Querying smth like this:
fq=category_ngram:child&facet.field=category_exactly

would give me something like this (for multivalued category fields):
"toys for children"
"games"
"memory"
"tv"




W dniu 29.10.2012 o 13:12 Rafał Kuć  pisze:


Hello!

Do you have to use faceting for prefixing ? Maybe it would be better to  
use ngram based field and return the stored value ?






--
Grzegorz Sobczyk

Re: SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + multibyte chars => crash?

2012-10-29 Thread Ahmet Arslan

Hi Tomas,

I think this is same case Marian reported before.

https://issues.apache.org/jira/browse/SOLR-3193
https://issues.apache.org/jira/browse/SOLR-3901


--- On Mon, 10/29/12, Tomas Zerolo  wrote:

> From: Tomas Zerolo 
> Subject: SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + 
> multibyte chars => crash?
> To: solr-user@lucene.apache.org
> Date: Monday, October 29, 2012, 5:23 PM
> Hi, SOLR gurus
> 
> we're experiencing a crash with SOLR 4.0 whenever the
> results contain
> multibyte characters (more precisely: German umlauts, utf-8
> encoded).
> 
> The crashes only occur when using
> ReversedWildcardFilterFactory (which
> is necessary in 4.0 to be able to have wildcards at the
> beginning of
> the search pattern, as far as I understand), *and* the
> highlighter is
> on. The stack trace (heavily snipped) looks like this:
> 
>  | 12.09.2012 13:08:12 org.apache.solr.common.SolrException
> log
>  | SCHWERWIEGEND: org.apache.solr.common.SolrException:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
> Token substantial exceeds length of provided text sized
> 5107
>  |         at
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:517)
>  |         at
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
>  |         at
> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
>  |         at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
>  | [...]
>  |         at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
>  |         at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
>  |         at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
>  |         at
> java.lang.Thread.run(Thread.java:662)
>  | Caused by:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
> Token substantial exceeds length of provided text sized
> 5107
>  |         at
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
>  |         at
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:510)
>  |         ... 32 more
> 
> (excuse the German locale.) 
> 
> Poking around in the sources seems to point (to my untrained
> eye, that
> is) to:
> 
>   
> 
> Is this the issue biting us? Any known workarounds?
> Anything
> we might try to pin-point the problem resp. to fix the bug?
> 
> Thanks for any insights, regards
> -- 
> Tomás Zerolo
> Axel Springer AG
> Axel Springer media Systems
> BILD Produktionssysteme
> Axel-Springer-Straße 65
> 10888 Berlin
> Tel.: +49 (30) 2591-72875
> tomas.zer...@axelspringer.de
> www.axelspringer.de
> 
> Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg,
> HRB 4998
> Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
> Vorstand: Dr. Mathias Döpfner (Vorsitzender)
> Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele
>

Re: Is it possible to use something like sum() in a solr-query?

Hello!

Take a look at StatsComponent - http://wiki.apache.org/solr/StatsComponent

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> I have for example an integer field and want to sum all these values for
> all the matching documents.

> Similar to this in sql:

> SELECT SUM(/expression/ )
> FROM tables
> WHERE predicates;


> Regards,
> Markus

> On 29.10.2012 22:25, Jack Krupansky wrote:
>> Unfortunately, neither the subject nor your message says it all. Be 
>> specific - what exactly do you want to sum? All matching docs? Just 
>> the returned docs? By group? Or... what?
>>
>> You can of course develop your own search component that does whatever 
>> it wants with the search results.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Markus.Mirsberger
>> Sent: Monday, October 29, 2012 11:08 AM
>> To: solr-user@lucene.apache.org
>> Subject: Is it possible to use something like sum() in a solr-query?
>>
>> Hi,
>>
>> the subject says it all :)
>> Is there something like sum() available in a solr query to sum all
>> values of a field ?
>>
>> Regards,
>> Markus Mirsberger

Re: Is it possible to use something like sum() in a solr-query?

2012-10-29 Thread Markus.Mirsberger

I have for example an integer field and want to sum all these values for 
all the matching documents.


Similar to this in sql:

SELECT SUM(/expression/ )
FROM tables
WHERE predicates;


Regards,
Markus

On 29.10.2012 22:25, Jack Krupansky wrote:
Unfortunately, neither the subject nor your message says it all. Be 
specific - what exactly do you want to sum? All matching docs? Just 
the returned docs? By group? Or... what?


You can of course develop your own search component that does whatever 
it wants with the search results.


-- Jack Krupansky

-Original Message- From: Markus.Mirsberger
Sent: Monday, October 29, 2012 11:08 AM
To: solr-user@lucene.apache.org
Subject: Is it possible to use something like sum() in a solr-query?

Hi,

the subject says it all :)
Is there something like sum() available in a solr query to sum all
values of a field ?

Regards,
Markus Mirsberger

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download



Jack:

I respect your hard-work responding to user problems on the mail list. 
So it would be nicer to try out Realtime NRT then pass rogue comments, 
whether a contribution is legit/spam or a scam... I guess it illuminates 
the narrow minded view of oneself ...  The spirit of open source is 
contributions from not only commiters but other developers, from the 
Solr wiki "A half-baked patch in Jira, with no documentation, no tests 
and no backwards compatibility is better than no patch at all."


You would gain more respect if you actually download realtime-nrt, check 
out if it does provide a view of a realtime index compared to a  
point-in-time snapshot, see if you can understand the code and provide 
clarity  and feedback to the list if you do find problems with it. 
realtime-nrt offers search capability as to realtime-get. Checkout if  
this is true ... I would really welcome your comments on the list or 
through the JIRA here:


https://issues.apache.org/jira/browse/SOLR-3816

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 10/29/2012 7:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled "Apache ABC with XYZ" be an 
"external project" and be sanctioned/licensed by Apache? In fact, the 
linked web page doesn't even acknowledge the ownership of the Apache 
trademarks or ASL. And the term "Realtime NRT" is nonsensical. Even 
worse: "Realtime NRT makes available a near realtime view". Equally 
nonsensical. Who knows, maybe it is legit, but it sure comes across as 
a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here 
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 
api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

Re: Is it possible to use something like sum() in a solr-query?

Unfortunately, neither the subject nor your message says it all. Be 
specific - what exactly do you want to sum? All matching docs? Just the 
returned docs? By group? Or... what?


You can of course develop your own search component that does whatever it 
wants with the search results.


-- Jack Krupansky

-Original Message- 
From: Markus.Mirsberger

Sent: Monday, October 29, 2012 11:08 AM
To: solr-user@lucene.apache.org
Subject: Is it possible to use something like sum() in a solr-query?

Hi,

the subject says it all :)
Is there something like sum() available in a solr query to sum all
values of a field ?

Regards,
Markus Mirsberger

SOLR 4.0 + ReversedWildcardFilterFactory + DefaultSolrHighlighter + multibyte chars => crash?

2012-10-29 Thread Tomas Zerolo

Hi, SOLR gurus

we're experiencing a crash with SOLR 4.0 whenever the results contain
multibyte characters (more precisely: German umlauts, utf-8 encoded).

The crashes only occur when using ReversedWildcardFilterFactory (which
is necessary in 4.0 to be able to have wildcards at the beginning of
the search pattern, as far as I understand), *and* the highlighter is
on. The stack trace (heavily snipped) looks like this:

 | 12.09.2012 13:08:12 org.apache.solr.common.SolrException log
 | SCHWERWIEGEND: org.apache.solr.common.SolrException: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 
substantial exceeds length of provided text sized 5107
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:517)
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
 | at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
 | at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
 | [...]
 | at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
 | at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
 | at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
 | at java.lang.Thread.run(Thread.java:662)
 | Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
Token substantial exceeds length of provided text sized 5107
 | at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
 | at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:510)
 | ... 32 more

(excuse the German locale.) 

Poking around in the sources seems to point (to my untrained eye, that
is) to:

  

Is this the issue biting us? Any known workarounds? Anything
we might try to pin-point the problem resp. to fix the bug?

Thanks for any insights, regards
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele

Is it possible to use something like sum() in a solr-query?

2012-10-29 Thread Markus.Mirsberger


Hi,

the subject says it all :)
Is there something like sum() available in a solr query to sum all 
values of a field ?


Regards,
Markus Mirsberger

Re: Exception while getting Field info in Lucene

2012-10-29 Thread adityab

Hi Erick,

I have upgraded from 3.5 to 4.0 on first index build i was able to see the
distinct terms on UI under Schema Browser section. I had to add a new date
field as stored to schema and re-index.
After that all the time on UI for every field i see the distinct term as
"-1". 
Can you please advice if i am missing anything to configure?

here is the luke response for field "_version_"

http://.../solr/admin/luke?fl=_version_



long
ITSOF-
-TS---
16392300
*-1*

1
1
1
1
1
1
1
1
1
1


16392300



Looks like this problem can be solved by performing optimize as posted in
one of the archives
http://lucene.472066.n3.nabble.com/always-getting-distinct-count-of-1-in-luke-response-solr4-snapshot-td3985546.html

Just curious what makes this value "-1"



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-while-getting-Field-info-in-Lucene-tp4016448p4016707.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using solrj CloudSolrServer with Zookeeper ensemble

2012-10-29 Thread Timothy Potter

Hi Tobias,

You can pass a comma-delimited list of Zk addresses in your ensemble, such
as:

zk1:2181,zk2:2181,zk3:2181, etc.

Cheers,
Tim

On Mon, Oct 29, 2012 at 2:42 AM, Tobias Kraft wrote:

> Hi,
>
> when I need high availability for my Solr environment it is recommended to
> run a Zookeeper ensemble as described at the SolrCloud page:
>
> http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble
>
> For accessing Solr we are using the solrj API which also contains the
> CloudSolrServer
> class for accessing Solr (
>
> http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/client/solrj/impl/CloudSolrServer.html
> ).
> Is there a way to use this class together with a Zookeeper ensemble?
>
> The constructor of the class accepts only one Zookeeper in the form
> HOST:PORT
> see also
> http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/client/solrj/impl/CloudSolrServer.html#CloudSolrServer(java.lang.String
> ,
> org.apache.solr.client.solrj.impl.LBHttpSolrServer)
>
> Is there another way to use solrj with a Zookeeper ensemble? Otherwise we
> would use an external load balancer and the solrj HttpSolrServer class.
>
> Many thanks,
> Tobias
>

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Darren Govoni

It certainly seems to be a rogue project, but I can't understand the 
meaning of "realtime near realtime (NRT)" either. At best, its oxymoronic.



On 10/29/2012 10:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled "Apache ABC with XYZ" be an 
"external project" and be sanctioned/licensed by Apache? In fact, the 
linked web page doesn't even acknowledge the ownership of the Apache 
trademarks or ASL. And the term "Realtime NRT" is nonsensical. Even 
worse: "Realtime NRT makes available a near realtime view". Equally 
nonsensical. Who knows, maybe it is legit, but it sure comes across as 
a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here 
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 
api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

RE: DIH nested entities don't work

2012-10-29 Thread Dyer, James

If your subentities are large, the default DIH Cache probably isn't going to 
work because it stores all the data in-memory.   (This is 
CachedSQLEntityProcessor for Solr 3.5 or earlier ; 
cacheImpl="SortedMapBackedCache" for 3.6 or later)

DIH for Solr 3.6 and later supports pluggable caches (see 
https://issues.apache.org/jira/browse/SOLR-2382), so you have the option of 
caching to disk.  Unfortunately the only good disk-backed cache available here 
uses Berkley Database, which has an incompatible license and cannot be included 
with an Apache project.  See https://issues.apache.org/jira/browse/SOLR-2613 
for the code ; you'll have to download bdb-je from Oracle yourself.  We also 
converted from Endeca, and needed these cache options to replace the Forge 
Cache feature which we depended on heavily for joins.  It was a lot of work to 
set this up with DIH and getting everything to work correctly but the end 
result for us is actually a lot faster (and way more flexible) than Forge ever 
was.

By the way, there have been sporatic reports of unexpected behavior using 
Caching with 3.6.  You may want to try 4.0 if you're currently running 3.6.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: mroosendaal [mailto:mroosend...@yahoo.com] 
Sent: Monday, October 29, 2012 5:06 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH nested entities don't work

Hi,

It seems to work without the cache option, the downside is it will takes
ages for everything to be indexed and my testset is 20 times smaller than
the productset.

Indexing just the root item takes 3 minutes (>600K) but every subentity
takes more time which is obvious but i would've hoped it would at least be
faster.

Our current searchengine (Endeca) does the same thing but takes 'only'
1h20m.

How can i speed this up, the bottleneck is not the CPU or memory, but simply
the databasetime.

Thanks,
Maarten



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4016618.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled "Apache ABC with XYZ" be an 
"external project" and be sanctioned/licensed by Apache? In fact, the linked 
web page doesn't even acknowledge the ownership of the Apache trademarks or 
ASL. And the term "Realtime NRT" is nonsensical. Even worse: "Realtime NRT 
makes available a near realtime view". Equally nonsensical. Who knows, maybe 
it is legit, but it sure comes across as a scam/spam.


-- Jack Krupansky

-Original Message- 
From: Nagendra Nagarajayya

Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime 
NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here 
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 api.

You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

RE: Having problem runing apache-solr on my linux server

2012-10-29 Thread Markus Jelsma


Hi - Detach it from the terminal: java -jar start.jar &


 
 
-Original message-
> From:zakari mohammed 
> Sent: Mon 29-Oct-2012 15:22
> To: solr-user@lucene.apache.org
> Subject: Having problem runing apache-solr on my linux server
> 
> 
> hello dear,
> I try running apache-solr 3.6.1 on my linux host using java -jar start.jar, 
> everything is running fine.
> but java stop running as soon as the command terminal close. i have searching 
> for means of making it run continuously but i did not see.
> 
> please i need help on making apache-solr running without stopping at the 
> terminal close.
> 
> thanks.
> Zakari M  
>

[Announce] Realtime NRT integrated with Solr available for download


Hi!

I am very excited to announce the availability of an integrated Apache 
Solr 4.0 with Realtime NRT download:


http://solr-ra.tgels.org/realtime-nrt.jsp

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

Here is more info about Realtime NRT:

Lucene/Solr search and commit architecture is designed to work off a 
"point-in-time snapshots" of the index. Any add/update/delete needs a 
commit to be visible to searches (or atleast a soft-commit). soft-commit 
re-opens the SolrIndexSearcher object and can be a performance 
limitation if the soft-commits happen more than one per second, see 
blog:http://searchhub.org/dev/2011/09/07/realtime-get/. Realtime NRT 
makes available a near realtime view of the index. So any changes made 
to the index is immediately visible. Performance is not a limitation as 
it does not close the SolrIndexSearcher object as with soft-commit.


Realtime NRT is also different from realtime-get which is a simple 
lookup by id and needs the transaction log to be enabled. realtime-get 
does not have search capability. Realtime NRT allows full search, so you 
could search by id, text, location, etc. using boolean, dismax, 
faceting, range queries ie. no change to existing functionality. No new 
request handlers to be defined in solrconfig.xml. So all of your 
existing queries work as it is with no changes, except that the results 
returned are in near real time. Realtime NRT also does not need the 
transaction update log needed by realtime-get. So you can turn this off 
for improved performance. autoCommit freq can also be increased to an 
hour from the default of 15 secs for improved performance (remember 
commits can slow down your performance)



Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

Re: Having problem runing apache-solr on my linux server

Hello!

What you are experiencing is normal - when you run an application in
the foreground it will close when you close the terminal. You can use
command like screen or you can use nohup. However I would advise
installing a standalone Jetty or Tomcat and just deploy Solr as any
other web application.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch


> hello dear,
> I try running apache-solr 3.6.1 on my linux host using java -jar
> start.jar, everything is running fine.
> but java stop running as soon as the command terminal close. i have
> searching for means of making it run continuously but i did not see.

> please i need help on making apache-solr running without stopping at the 
> terminal close.

> thanks.
> Zakari M

Having problem runing apache-solr on my linux server

2012-10-29 Thread zakari mohammed


hello dear,
I try running apache-solr 3.6.1 on my linux host using java -jar start.jar, 
everything is running fine.
but java stop running as soon as the command terminal close. i have searching 
for means of making it run continuously but i did not see.

please i need help on making apache-solr running without stopping at the 
terminal close.

thanks.
Zakari M

Re: select one document with a given value for an attribute in a single page

Hello!

Try the grouping feature of Solr -
http://wiki.apache.org/solr/FieldCollapsing . You can collapse
documents based on the field value, which would be a shop identifier
in your case.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> I have the Hi all,

> I have the following schema

> offer:
> - offer_id
> - offer_title
> - offer_description
> - related_shop_id
> - related shop_name
> - offer_price

> Each offer is related to a shop.
> In one shop, we have many offers

> I would like show in one page (26 offers) only one offer from a shop.

> I need help to implement this feature. 


> Thanks



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/select-one-document-with-a-given-value-for-an-attribute-in-a-single-page-tp4016679.html
> Sent from the Solr - User mailing list archive at Nabble.com.

[Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with 
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high 
performance and more granular NRT implementation as to soft commit. The 
update performance is about 70,000 documents / sec* (almost 1.5-2x 
performance improvement over soft-commit). You can also scale up to 2 
billion documents* in a single core, and query half a billion documents 
index in ms**. Realtime NRT is different from realtime-get. realtime-get 
does not have search capability and is a lookup by id. Realtime NRT 
allows full search, see here  
for more info.


Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or 
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4 
and Realtime NRT performance from here:

http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as 
seen at a user installation

** performance seen when using the age feature

select one document with a given value for an attribute in a single page

2012-10-29 Thread Jamel ESSOUSSI

I have the Hi all,

I have the following schema

offer:
- offer_id
- offer_title
- offer_description
- related_shop_id
- related shop_name
- offer_price

Each offer is related to a shop.
In one shop, we have many offers

I would like show in one page (26 offers) only one offer from a shop.

I need help to implement this feature. 


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/select-one-document-with-a-given-value-for-an-attribute-in-a-single-page-tp4016679.html
Sent from the Solr - User mailing list archive at Nabble.com.

hot shard concept

2012-10-29 Thread Dmitry Kan

Hi everyone,

at this year's Berlin Buzz words conference someone (sematext?) have
described a technique of a hot shard. The idea is to have a slim shard to
maximize the update throughput during a day (when millions of docs need to
be posted) and make sure the indexed documents are immediately searchable.
In the end of the day the day's documents are moved to cold shards. If I'm
not mistaken, this was implemented for ElasticSearch. I'm currently
implementing something similar (but pretty tailored to our logical sharding
use case) for Solr (3.x). The feature set looks roughly like this:

1) front end solr (query router) is aware of the hot shard: it directs the
incoming queries to the hot and "cold" shards.
2) new incoming documents are directed first to the hot shard and then
periodically (like once a day or once a week) moved over to the closest in
time cold shard. And for that...
3) hot shard index is being partitioned low level using Lucene's
IndexReader / IndexWriter with the implementation based on [1], [2] and
customized to logical (time-based) sharding.


The question is: is doing index partitioning low-level a good way of
implementing the hot shard concept? That is, is there anything better
operationally-wise from the point of view of disaster recovery / search
cluster support? Am I missing some obvious SOLR-ish solution?
Doing instead the periodical hot shard cleaning and re-posting its source
documents to the closest cold shard is less modular and hence more
complicated operationally for us.

Please let me know, if you need more details or if the problem isn't clear
enough. Thanks.

[1]
http://blog.foofactory.fi/2008/01/regenerating-equally-sized-shards-from.html
[2] https://github.com/HON-Khresmoi/hash-based-index-splitter

-- 
Regards,

Dmitry Kan

Remove entries from search result, custom collector

2012-10-29 Thread Markus Jelsma

Hi,

We want to remove some results from the result set based on the result of some 
algorithms on some fields in adjacent documents. For example, if doc2 resembles 
or doc1 we want to remove it. We cannot do this in a search component because 
of problems with paging, maintaining rows=N results despite removal of some 
records etc. Instead i'd like to try to override the TopScoreDocCollector in 
SolrIndexSearcher and implement Collector.collect(int doc), however, the 
Javadoc states that it's not a good idea to use IndexSearcher or IndexReader to 
obtain the document and some fields.

Any hints to share to keep up performance in the collector? Any other ideas 
except implement as search component or use field collapsing?

Thanks,
Markus

Re: DIH nested entities don't work

2012-10-29 Thread Gora Mohanty

On 29 October 2012 15:36, mroosendaal  wrote:
> Hi,
>
> It seems to work without the cache option, the downside is it will takes
> ages for everything to be indexed and my testset is 20 times smaller than
> the productset.

Someone else will have to respond as to how to use CachedSqlEntity in
Solr 4.0. I have not tried it out yet.

> Indexing just the root item takes 3 minutes (>600K) but every subentity
> takes more time which is obvious but i would've hoped it would at least be
> faster.
[...]

This, on the other hand, does not sound right: 3min per record seems way too
high. Have you given Solr enough memory? The bottleneck could also be the
network or the database.

> How can i speed this up, the bottleneck is not the CPU or memory, but simply
> the databasetime.

What do you mean by the "databasetime". It seems odd that the database
would take long to respond to a simple select. Are your tables indexed?
Have you checked CPU/RAM on the database server?

Regards,
Gora

Re: facet prefix with tokenized fields

Hello!

Do you have to use faceting for prefixing ? Maybe it would be better to use 
ngram based field and return the stored value ?


-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

>  
> Hi.

> Is there any solution to facet documents with specified prefix on
> some tokenized field, but in result gets the original value of a field?




> e.q.: 

>  stored="true" multiValued="true" />


> 
> 
>  class="solr.StandardTokenizerFactory" />
> 
> 
> 







> Indexed value: "toys for children"

> query:
> q=&start=0&rows=0&facet.limit=-1&facet.mincount=1&f.category_ac.facet.prefix=chi&facet.field=category_ac&facet=true




> I'd like to get exacly "toys for children", not "children"




> --

> Grzegorz Sobczyk

facet prefix with tokenized fields

2012-10-29 Thread Grzegorz Sobczyk


Hi.Is there any solution to facet documents with specified prefix on some tokenized field, but in result gets the original value of a field?e.q.: 		 		 			 			 Indexed value: "toys for children"query: q=&start=0&rows=0&facet.limit=-1&facet.mincount=1&f.category_ac.facet.prefix=chi&facet.field=category_ac&facet=trueI'd like to get exacly "toys for children", not "children"--Grzegorz Sobczyk

Re: improving score of result set

I don't think you're reading the grouping right. When you use grouping,
you get the top N groups, and within each group you get the top M
scoring documents. So you can actually get _more_ documents back than in
the non-grouping case and your app can then intelligently intersperse them
however you want.

Best
Erick

On Mon, Oct 29, 2012 at 5:02 AM, Alexander Aristov
 wrote:
> Interesting but not exactly what I want to get.
>
> If I group items then I will get small number of docs. I don't want this. I
> need all of them.
>
> Best Regards
> Alexander Aristov
>
>
> On 29 October 2012 12:05, yunfei wu  wrote:
>
>> Besides changing the scoring algorithm, what about "Field Collapsing" -
>> http://wiki.apache.org/solr/FieldCollapsing - to collapse the results from
>> same website url?
>>
>> Yunfei
>>
>>
>> On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov <
>> alexander.aris...@gmail.com> wrote:
>>
>> > Hi everybody,
>> >
>> > I have a question about scoring calculation algorithms and approaches.
>> >
>> > Lets say I have 10 documents. 8 of the them come from one web site (I
>> have
>> > a field in schema with URL) and the other 2 from other different web
>> sites.
>> > So for this example I have 3 web sites.
>> >
>> > For some queries those 8 documents have better terms matching and they
>> > appear at the top of results. It makes that 8 docs from one source come
>> > first and the other two come next and the last.
>> >
>> > I want to maybe artificially improve score of those 2 docs and put them
>> > atop. I don't want that they necessarily go first but if they come in the
>> > middle of the result set it would be perfect.
>> >
>> > One of the ideas is to reduce score for docs in the result set from one
>> > site so that if it contains too many docs from one source total scoring
>> of
>> > each those docs would be reduced proportionally.
>> >
>> > Important thing is that I don't want to reduce doc score permanently.
>> Only
>> > at query time. Maybe some functional queries can help me?
>> >
>> > How can I do this or maybe there are other ideas.
>> >
>> > Best Regards
>> > Alexander Aristov
>> >
>>

SOLR - To point multiple indexes in different folder

2012-10-29 Thread ravi.n

Hello Solr Gurus,

I am newbie to solr application, below are my requirements:

1. We have 7 folders having indexed files, which SOLR application to be
pointed. I understand shards feature can be used for searching. If there is
any other alternative. Each folder has around 24 million documents.
2. We should configure solr for indexing new incoming data from database/SCV
file, whtas is the required configuration in solr to achieve this?

Any quick response on this will be appreciated.
Thanks

Regards,
Ravi



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-To-point-multiple-indexes-in-different-folder-tp4016640.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: throttle segment merging

2012-10-29 Thread Michael McCandless

With Lucene 4.0, FSDirectory now supports merge bytes/sec throttling
(FSDirectory.setMaxMergeWriteMBPerSec): it rate limits that max
bytes/sec load on the IO system due to merging.

Not sure if it's been exposed in Solr / ElasticSearch yet ...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 29, 2012 at 7:07 AM, Tomás Fernández Löbbe
 wrote:
>>
>>  Is there way to set-up logging to output something when segment merging
 runs?

  I think segment merging is logged when you enable infoStream logging
>>> (you
>>> should see it commented in the solrconfig.xml)
>>>
>> no, segment merging is not logged at info level. it needs customized log
>> config.
>
>
> INFO level is not the same as infoStream. See solrconfig, there is a
> commented section that talks about it, and if you uncomment it it will
> generate a file with low level Lucene logging. This file will include
> segments information, including merging.
>
>>
>>
>>
>>>  Can be segment merges throttled?

>>> > You can change when and how segments are merged with the merge policy,
>> maybe it's enough for you changing the initial settings (mergeFactor for
>> example)?
>>
>> I am now researching elasticsearch, it can do it, its lucene 3.6 based
>>
>
>
> I don't know if this is what you are looking for, but the TieredMergePolicy
> (default) allows you to set maximum number of segments to be merged at once
> and maximum size of segments to be created during normal merging.
> Other option is, as you said, create a Jira for a new merge policy.
>
> Tomás

Re: throttle segment merging

2012-10-29 Thread Tomás Fernández Löbbe

>
>  Is there way to set-up logging to output something when segment merging
>>> runs?
>>>
>>>  I think segment merging is logged when you enable infoStream logging
>> (you
>> should see it commented in the solrconfig.xml)
>>
> no, segment merging is not logged at info level. it needs customized log
> config.


INFO level is not the same as infoStream. See solrconfig, there is a
commented section that talks about it, and if you uncomment it it will
generate a file with low level Lucene logging. This file will include
segments information, including merging.

>
>
>
>>  Can be segment merges throttled?
>>>
>> > You can change when and how segments are merged with the merge policy,
> maybe it's enough for you changing the initial settings (mergeFactor for
> example)?
>
> I am now researching elasticsearch, it can do it, its lucene 3.6 based
>


I don't know if this is what you are looking for, but the TieredMergePolicy
(default) allows you to set maximum number of segments to be merged at once
and maximum size of segments to be created during normal merging.
Other option is, as you said, create a Jira for a new merge policy.

Tomás

Re: throttle segment merging

2012-10-29 Thread Radim Kolar

is there JIRA ticket dedicated to throttling segment merge? i could not 
find any, but jira search kinda sucks.


It should be ported from ES because its not much code.

Re: DIH nested entities don't work

2012-10-29 Thread mroosendaal

Hi,

It seems to work without the cache option, the downside is it will takes
ages for everything to be indexed and my testset is 20 times smaller than
the productset.

Indexing just the root item takes 3 minutes (>600K) but every subentity
takes more time which is obvious but i would've hoped it would at least be
faster.

Our current searchengine (Endeca) does the same thing but takes 'only'
1h20m.

How can i speed this up, the bottleneck is not the CPU or memory, but simply
the databasetime.

Thanks,
Maarten



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4016618.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: throttle segment merging

2012-10-29 Thread Radim Kolar


Dne 29.10.2012 0:09, Lance Norskog napsal(a):

1) Do you use compound files (CFS)? This adds a lot of overhead to merging.

i do not know. whats solr configuration statement for turning them on/off?

2) Does ES use the same merge policy code as Solr?

ES rate limiting:

http://www.elasticsearch.org/guide/reference/index-modules/store.html
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/indices/store/IndicesStore.java
http://rajish.github.com/api/elasticsearch/0.20.0.Beta1-SNAPSHOT/org/apache/lucene/store/StoreRateLimiting.html

Re: Occasional Solr performance issues

2012-10-29 Thread Dotan Cohen

On Mon, Oct 29, 2012 at 7:04 AM, Shawn Heisey  wrote:
> They are indeed Java options.  The first two control the maximum and
> starting heap sizes.  NewRatio controls the relative size of the young and
> old generations, making the young generation considerably larger than it is
> by default.  The others are garbage collector options.  This seems to be a
> good summary:
>
> http://www.petefreitag.com/articles/gctuning/
>
> Here's the official Sun (Oracle) documentation on GC tuning:
>
> http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
>

Thank you Shawn! Those are exactly the documents that I need. Google
should hire you to fill in the pages when someone searches for "java
garbage collection". Interestingly, I just check and bing.com does
list the Oracle page on the first pager of results. I shudder to think
that I might have to switch search engines!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: improving score of result set