date:20140905

Append children documents for nested document

2014-09-05 Thread bradhill99

Hi, I have these nested document in solr:
curl http://localhost:8983/solr/update/json?commit=true -H
'Content-type:application/json' -d '
[
  {
id: chapter1,
content_type: chapter,
_childDocuments_: [
  {
id: 1-1,
text: xxx
  },
  {
id: 1-2,
text: yyy
  }
]
  }
]
'
Then I would like to use atomic updates to add one more child document under
parent document id:chapter1, like:
curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d
'
[
 {
  id : chapter1,
  _childDocuments_ : 
{
add:{
id:1-3, 
text: zzz
}
}
 }
]'

It doesn't work and solr return
{responseHeader:{status:400,QTime:0},error:{msg:Expected:
ARRAY_START but got OBJECT_START at [58],code:400}}

How can I add children documents for specific parent documents?

thanks,

Brad



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Append-children-documents-for-nested-document-tp4157087.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Edismax mm and efficiency

2014-09-05 Thread Mikhail Khludnev

indeed https://issues.apache.org/jira/browse/LUCENE-4571
my feeling is it gives a significant gain in mm high values.



On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org
wrote:

 Are there any speed advantages to using “mm”? I can imagine pruning the
 set of matching documents early, which could help, but is that (or
 something else) done?

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/





-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: indexing unique keys

2014-09-05 Thread Mikhail Khludnev

Hello,

You are asking without giving a context. What's the size of sets, desired
TPS, key length, and even values?
It's hard to answer definitely. It's not primary usage for Lucene, it adds
some unnecessary overhead. However, community collected a few workaround
for such kind of problem. From the other side, as far as I know executing
queries like WHERE x IN (1,,2324) is not a piece of cake for SQL
servers, also.

you can follow link at
https://plus.google.com/u/0/+MichaelMcCandless/posts/8VNydNi3wvK to find a
relevant benchmark. it might help you to get least estimates for the Lucene
solution.



On Thu, Sep 4, 2014 at 5:53 PM, Mark , N nipen.m...@gmail.com wrote:

 I have a use-case where we want to store unique keys ( Hashes)  which would
 be
 used to compare against another set of  keys ( Hashes)

 For example

  Index  set= { h1, h2 , h3 , h4 }

 comparision set = { h1 , h2 }

 result set = h1,h2

 Would it be an advantage to store index set in  Solr instead of storing
 in traditional databases?

 Thanks in advance






 *Nipen Mark *




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan

HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

RE: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan

Hi Erick,

As Ryan Ernst noticed, those big fields (eg majorTextSignalStem)  is not 
stored. There are a few stored fields in my schema, but they are very small 
fields basically name or id for that document.  I tried turn them off(only 
store id filed) and that didn't make any difference.

Thanks,
Ryan

Ryan:

As it happens, there's a discssion on the dev list about this.

If at all possible, could you try a brief experiment? Turn off
all the storage, i.e. set stored=false on all fields. It's a lot
to ask, but it'd help the discussion.

Or join the discussion at https://issues.apache.org/jira/browse/LUCENE-5914.

Best,
Erick


From: Li, Ryan
Sent: Friday, September 05, 2014 3:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr add document over 20 times slower after upgrade from 4.0 to 
4.9


HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan

Hi Guys,

Just some update.

I've tried with Solr 4.10 (same code for Solr 4.9). And that has the same index 
speed as 4.0. The only problem left now is that Solr 4.10 takes more memory 
than 4.0 so I'm trying to figure out what is the best number for Java heap size.

I think that proves there is some performance issue with Solr 4.9 when index 
big document (even just over 1mb).

Thanks,
Ryan

FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)

Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
semantic fingerprint of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen

SolrJ 4.10.0 errors

2014-09-05 Thread Guido Medina


Hi,

I have upgraded to from Solr 4.9 to 4.10 and the server side seems fine 
but the client is reporting the following exception:


org.apache.solr.client.solrj.SolrServerException: IOException occured 
when talking to server at: solr_host.somedomain
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at ... (company's related packages)
Caused by: org.apache.http.NoHttpResponseException: solr_host.somedomain 
failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at 
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)

... 9 more

To test I downgraded the client to 4.9 and the error is gone.

Best regards,

Guido.

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread jim ferenczi

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
:

 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen

Re: SolrJ 4.10.0 errors

2014-09-05 Thread Guido Medina

Sorry I didn't give enough information so I'm adding to it, the SolrJ 
client is on our webapp and the documents are getting indexed properly 
into Solr, the only problem we are seeing is that with SolrJ 4.10 once 
Solr server response comes back it seems like SolrJ client doesn't know 
what to with such response and reports the exception I mentioned, I then 
downgraded the SolrJ client to 4.9 and the exception is now gone, I'm 
using the following relevant libraries:


Java 7u67 64 bits at both webapp client side side and Jetty's
HTTP client/mine 4.3.5
HTTP core 4.3.2

Here is a list of my Solr war modified lib folder, I usually don't stay 
with the standard jars because I believe most of them are out of date if 
you are running a JDK 7u55+:


   antlr-runtime-3.5.jar
   asm-4.2.jar
   asm-commons-4.2.jar
   commons-cli-1.2.jar
   commons-codec-1.9.jar
   commons-configuration-1.9.jar
   commons-fileupload-1.3.1.jar
   commons-io-2.4.jar
   commons-lang-2.6.jar
   concurrentlinkedhashmap-lru-1.4.jar
   dom4j-1.6.1.jar
   guava-18.0.jar
   hadoop-annotations-2.2.0.jar
   hadoop-auth-2.2.0.jar
   hadoop-common-2.2.0.jar
   hadoop-hdfs-2.2.0.jar
   hppc-0.5.2.jar
   httpclient-4.3.5.jar
   httpcore-4.3.2.jar
   httpmime-4.3.5.jar
   joda-time-2.2.jar
   lucene-analyzers-common-4.10.0.jar
   lucene-analyzers-kuromoji-4.10.0.jar
   lucene-analyzers-phonetic-4.10.0.jar
   lucene-codecs-4.10.0.jar
   lucene-core-4.10.0.jar
   lucene-expressions-4.10.0.jar
   lucene-grouping-4.10.0.jar
   lucene-highlighter-4.10.0.jar
   lucene-join-4.10.0.jar
   lucene-memory-4.10.0.jar
   lucene-misc-4.10.0.jar
   lucene-queries-4.10.0.jar
   lucene-queryparser-4.10.0.jar
   lucene-spatial-4.10.0.jar
   lucene-suggest-4.10.0.jar
   noggit-0.5.jar
   org.restlet-2.1.1.jar
   org.restlet.ext.servlet-2.1.1.jar
   protobuf-java-2.6.0.jar
   solr-core-4.10.0.jar
   solr-solrj-4.10.0.jar
   spatial4j-0.4.1.jar
   wstx-asl-3.2.7.jar
   zookeeper-3.4.6.jar

Best regards,

Guido.

On 05/09/14 09:42, Guido Medina wrote:

Hi,

I have upgraded to from Solr 4.9 to 4.10 and the server side seems 
fine but the client is reporting the following exception:


org.apache.solr.client.solrj.SolrServerException: IOException occured 
when talking to server at: solr_host.somedomain
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)

at ... (company's related packages)
Caused by: org.apache.http.NoHttpResponseException: 
solr_host.somedomain failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at 
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)

... 9 more

To test I downgraded the client to 4.9 and the error is gone.

Best regards,

Guido.

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am
presently mapping docvectors to these mechanisms and create term vectors
myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the
performance of MoreLikeThis queries based on TermVectors is suboptimal
on large document sets, so a more efficient support of such retrievals
in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:
 Hi,
 Something like ?:
 https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
 And just to show some impressive search functionality of the wiki: ;)
 https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

 Cheers,
 Jim


 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
 :
 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch

Why do one big commit? You could do hard commits along the way but keep
searcher open and not see the changes until the end.

Obviously a separate issue from memory consumption discussion, but thought
I'll add it anyway.

Regards,
 Alex
On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote:

 HI Shawn,

 Thanks for your reply.

 The memory setting of my Solr box is

 12G physically memory.
 4G for java (-Xmx4096m)
 The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
 4.0.

 I do think the RAM size of java is one of the reasons for this slowness.
 I'm doing one big commit and when the ingestion process finished 50%, I can
 see the solr server already used over 90% of full memory.

 I'll try to assign more RAM to Solr Java. But from your experience, does
 4G sounds like a good number for Java heap size for my scenario? Is there
 any way to reduce memory usage during index time? (One thing I know is do a
 few commits instead of one commit. )  My concern is providing I have 12 G
 in total, If I assign too much to Solr server, I may not have enough for
 the OS to cache Solr index file.

 I had a look to solr config file, but couldn't find anything that
 obviously wrong, Just wondering which part of that config file would impact
 the index time?

 Thanks,
 Ryan





 One possible source of problems with that particular upgrade is the fact
 that stored field compression was added in 4.1, and termvector
 compression was added in 4.2.  They are on by default and cannot be
 turned off.  The compression is typically fast, but with very large
 documents like yours, it might result in pretty major computational
 overhead.  It can also require additional java heap, which ties into
 what follows:

 Another problem might be RAM-related.

 If your java heap is very large, or just a little bit too small, there
 can be major performance issues from garbage collection.  Based on the
 fact that the earlier version performed well, a too-small heap is more
 likely than a very large heap.

 If your index size is such that it can't be effectively cached by the
 amount of total RAM on the machine (minus the java heap assigned to
 Solr), that can cause performance problems.  Your index size is likely
 to be several gigabytes, and might even reach double-digit gigabytes.
 Can you relate those numbers -- index size, java heap size, and total
 system RAM?  If you can, it would also be a good idea to share your
 solrconfig.xml.

 Here's a wiki page that goes into more detail about possible performance
 issues.  It doesn't mention the possible compression problem:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Thanks,
 Shawn

statuscode list

2014-09-05 Thread Jan Verweij - Reeleez

Hi,

If I'm correct you will get a statuscode=0 in the response if you
use XML messages for updating the solr index.
Is there a list of possible other statuscodes you can receive in case
anything fails and what these errorcodes mean?

THNX,

Jan.

Re: Solr API for getting shard's leader/replica status

2014-09-05 Thread manohar211

Thanks for the comments!!
I found out the solution on how I can get the replica's state. Here's the
piece of code.

while (iter.hasNext()) {
Slice slice = iter.next();
for(Replica replica:slice.getReplicas()) {

System.out.println(replica state for  + replica.getStr(core)
+  : + replica.getStr( state ));

System.out.println(slice.getName());
System.out.println(slice.getState());
}
}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-API-for-getting-shard-s-leader-replica-status-tp4156902p4157108.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Mikhail Khludnev

On Fri, Sep 5, 2014 at 3:22 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


Alexandre,
I don't think it's can happen in solr-user list, next search pickups the
new searcher.

Ryan,
Regularly, commit is judged by application requirement, ie. when to make
updates visible. Memory consumption is judged by ramBufferSizeMB and
maxIndexingThreads. Exceeding the buffer, causes flush to disk, but doesn't
trigger commit.


 Obviously a separate issue from memory consumption discussion, but thought
 I'll add it anyway.

 Regards,
  Alex
 On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote:

  HI Shawn,
 
  Thanks for your reply.
 
  The memory setting of my Solr box is
 
  12G physically memory.
  4G for java (-Xmx4096m)
  The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
  4.0.
 
  I do think the RAM size of java is one of the reasons for this slowness.
  I'm doing one big commit and when the ingestion process finished 50%, I
 can
  see the solr server already used over 90% of full memory.
 
  I'll try to assign more RAM to Solr Java. But from your experience, does
  4G sounds like a good number for Java heap size for my scenario? Is there
  any way to reduce memory usage during index time? (One thing I know is
 do a
  few commits instead of one commit. )  My concern is providing I have 12 G
  in total, If I assign too much to Solr server, I may not have enough for
  the OS to cache Solr index file.
 
  I had a look to solr config file, but couldn't find anything that
  obviously wrong, Just wondering which part of that config file would
 impact
  the index time?
 
  Thanks,
  Ryan
 
 
 
 
 
  One possible source of problems with that particular upgrade is the fact
  that stored field compression was added in 4.1, and termvector
  compression was added in 4.2.  They are on by default and cannot be
  turned off.  The compression is typically fast, but with very large
  documents like yours, it might result in pretty major computational
  overhead.  It can also require additional java heap, which ties into
  what follows:
 
  Another problem might be RAM-related.
 
  If your java heap is very large, or just a little bit too small, there
  can be major performance issues from garbage collection.  Based on the
  fact that the earlier version performed well, a too-small heap is more
  likely than a very large heap.
 
  If your index size is such that it can't be effectively cached by the
  amount of total RAM on the machine (minus the java heap assigned to
  Solr), that can cause performance problems.  Your index size is likely
  to be several gigabytes, and might even reach double-digit gigabytes.
  Can you relate those numbers -- index size, java heap size, and total
  system RAM?  If you can, it would also be a good idea to share your
  solrconfig.xml.
 
  Here's a wiki page that goes into more detail about possible performance
  issues.  It doesn't mention the possible compression problem:
 
  http://wiki.apache.org/solr/SolrPerformanceProblems
 
  Thanks,
  Shawn
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for each 
item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a 
string parameter with the value of the docvector managed property of the item 
that is to be used as the similarity reference. The similarity vector consists 
of a set of term,weight expressions, indicating the most important terms or 
concepts in the item and the corresponding perceived importance (weight). Terms 
can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.

The similarity vector is created during item processing and indicates the most 
important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: Jürgen Wagner (DVT) 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently 
mapping docvectors to these mechanisms and create term vectors myself from 
third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the 
performance of MoreLikeThis queries based on TermVectors is suboptimal on large 
document sets, so a more efficient support of such retrievals in the Lucene 
kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
:
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
semantic fingerprint of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen





-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center Intelligence
 Senior Cloud Consultant 

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht 
Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Mikhail Khludnev

Jürgen,

I can't get it. Can you tell more about this feature or point to the doc?
Thanks


On Fri, Sep 5, 2014 at 11:44 AM, Jürgen Wagner (DVT) 
juergen.wag...@devoteam.com wrote:

 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

How to implement multilingual word components fields schema?

2014-09-05 Thread Ilia Sretenskii

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch

On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


 Alexandre,
 I don't think it's can happen in solr-user list, next search pickups the
 new searcher.

Why not? Isn't that what the Solr example configuration doing at:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
?
Hard commit does not reopen the searcher. The soft commit does
(further down), but that can be disabled to get the effect I am
proposing.

What am I missing?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Jack Krupansky

It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well 
for language detection, but not the latter, unless all fields of a given 
document are always in the same language.


Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.


-- Jack Krupansky

-Original Message- 
From: Ilia Sretenskii

Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Sandeep B A

Hi,

I was looking out the options for sentence tokenizers default in solr but
could not find it. Does any one used? Integrated from any other language
tokenizers to solr. Example python etc.. Please let me know.


Thanks and regards,
Sandeep

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
 For reference:

 “Item Similarity Vector Reference

 This property represents a similarity reference when searching for similar 
 items. This is a similarity vector representation that is returned for each 
 item in the query result in the docvector managed property.

 The value is a string formatted according to the following format:

 [string1,weight1][string2,weight2]...[stringN,weightN]

 When performing a find similar query, the SimilarTo element should contain a 
 string parameter with the value of the docvector managed property of the item 
 that is to be used as the similarity reference. The similarity vector 
 consists of a set of term,weight expressions, indicating the most important 
 terms or concepts in the item and the corresponding perceived importance 
 (weight). Terms can be single words or phrases.

 The weight is a float value between 0 and 1, where 1 indicates the highest 
 relevance.

 The similarity vector is created during item processing and indicates the 
 most important terms or concepts in the item and the corresponding weight.”

 See:
 http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

 -- Jack Krupansky

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky

Sounds like a great future to add to Solr, especially if it would facilitate 
more automatic relevancy enhancement. LucidWorks Search has a feature called 
unsupervised feedback that does that but something like a docvector might 
make it a more realistic default.


-- Jack Krupansky

-Original Message- 
From: Jürgen Wagner (DVT)

Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for 
each item in the query result in the docvector managed property.


The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain 
a string parameter with the value of the docvector managed property of the 
item that is to be used as the similarity reference. The similarity vector 
consists of a set of term,weight expressions, indicating the most 
important terms or concepts in the item and the corresponding perceived 
importance (weight). Terms can be single words or phrases.


The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.


The similarity vector is created during item processing and indicates the 
most important terms or concepts in the item and the corresponding 
 weight.”


See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

Re: Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Sandeep B A

Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
 On Sep 5, 2014 7:48 PM, Sandeep B A belgavi.sand...@gmail.com wrote:

 Hi,

 I was looking out the options for sentence tokenizers default in solr but
 could not find it. Does any one used? Integrated from any other language
 tokenizers to solr. Example python etc.. Please let me know.


 Thanks and regards,
 Sandeep

Re: Query ReRanking question

2014-09-05 Thread Ravi Solr

Thank you very much for responding. I want to do exactly the opposite of
what you said. I want to sort the relevant docs in reverse chronology. If
you sort by date before hand then the relevancy is lost. So I want to get
Top N relevant results and then rerank those Top N to achieve relevant
reverse chronological results.

If you ask Why would I want to do that ??

Lets take a example about Malaysian airline crash. several articles might
have been published over a period of time. When I search for - malaysia
airline crash blackbox - I would want to see relevant results but would
also like to see the the recent developments on the top i.e. effectively a
reverse chronological order within the relevant results, like telling a
story over a period of time

Hope i am clear. Thanks for your help.

Thanks

Ravi Kiran Bhaskar


On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com wrote:

 If you want the main query to be sorted by date then the top N docs
 reranked by a query, that should work. Try something like this:

 q=foosort=date+descrq={!rerank reRandDocs=1000
 reRankQuery=$myquery}myquery=blah


 Joel Bernstein
 Search Engineer at Heliosearch


 On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com wrote:

  Can the ReRanking API be used to sort within docs retrieved by a date
 field
  ? Can somebody help me understand how to write such a query ?
 
  Thanks
 
  Ravi Kiran Bhaskar

RE: Query ReRanking question

2014-09-05 Thread Markus Jelsma

Hi - You can already achieve this by boosting on the document's recency. The 
result set won't be exactly ordered by date but you will get the most relevant 
and recent documents on top.

Markus 

-Original message-
 From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
 Sent: Friday 5th September 2014 18:06
 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org 
 Subject: Re: Query ReRanking question

 Thank you very much for responding. I want to do exactly the opposite of
 what you said. I want to sort the relevant docs in reverse chronology. If
 you sort by date before hand then the relevancy is lost. So I want to get
 Top N relevant results and then rerank those Top N to achieve relevant
 reverse chronological results.

 If you ask Why would I want to do that ??

 Lets take a example about Malaysian airline crash. several articles might
 have been published over a period of time. When I search for - malaysia
 airline crash blackbox - I would want to see relevant results but would
 also like to see the the recent developments on the top i.e. effectively a
 reverse chronological order within the relevant results, like telling a
 story over a period of time

 Hope i am clear. Thanks for your help.

 Thanks

 Ravi Kiran Bhaskar

 On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com 
 mailto:joels...@gmail.com  wrote:

  If you want the main query to be sorted by date then the top N docs
  reranked by a query, that should work. Try something like this:

  q=foosort=date+descrq={!rerank reRandDocs=1000
  reRankQuery=$myquery}myquery=blah

  Joel Bernstein
  Search Engineer at Heliosearch

  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com 
  mailto:ravis...@gmail.com  wrote:

   Can the ReRanking API be used to sort within docs retrieved by a date
  field
   ? Can somebody help me understand how to write such a query ?

   Thanks

   Ravi Kiran Bhaskar

Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Erick Erickson

Alexandre:

It Depends (tm) of course. It all hinges on the setting in autocommit,
whether openSearcher is true or false.

In the former case, you, well, open a new searcher. In the latter you don't.

I agree, though, this is all tangential to the memory consumption issue since
the RAM buffer will be flushed regardless of these settings.

FWIW,
Erick

On Fri, Sep 5, 2014 at 7:11 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


 Alexandre,
 I don't think it's can happen in solr-user list, next search pickups the
 new searcher.

 Why not? Isn't that what the Solr example configuration doing at:
 https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
 ?
 Hard commit does not reopen the searcher. The soft commit does
 (further down), but that can be disabled to get the effect I am
 proposing.

 What am I missing?

 Regards,
Alex.

 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Re: Query ReRanking question

2014-09-05 Thread Erick Erickson

OK, why can't you switch the clauses from Joel's suggestion?

Something like:
q=Malaysia plane crashrq={!rerank reRankDocs=1000
reRankQuery=$myquery}myquery=*:*sort=date+desc

(haven't tried this yet, but you get the idea).

Best,
Erick

On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi - You can already achieve this by boosting on the document's recency. The 
 result set won't be exactly ordered by date but you will get the most 
 relevant and recent documents on top.

 Markus

 -Original message-
 From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
 Sent: Friday 5th September 2014 18:06
 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org
 Subject: Re: Query ReRanking question

 Thank you very much for responding. I want to do exactly the opposite of
 what you said. I want to sort the relevant docs in reverse chronology. If
 you sort by date before hand then the relevancy is lost. So I want to get
 Top N relevant results and then rerank those Top N to achieve relevant
 reverse chronological results.

 If you ask Why would I want to do that ??

 Lets take a example about Malaysian airline crash. several articles might
 have been published over a period of time. When I search for - malaysia
 airline crash blackbox - I would want to see relevant results but would
 also like to see the the recent developments on the top i.e. effectively a
 reverse chronological order within the relevant results, like telling a
 story over a period of time

 Hope i am clear. Thanks for your help.

 Thanks

 Ravi Kiran Bhaskar


 On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com 
 mailto:joels...@gmail.com  wrote:

  If you want the main query to be sorted by date then the top N docs
  reranked by a query, that should work. Try something like this:
 
  q=foosort=date+descrq={!rerank reRandDocs=1000
  reRankQuery=$myquery}myquery=blah
 
 
  Joel Bernstein
  Search Engineer at Heliosearch
 
 
  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com 
  mailto:ravis...@gmail.com  wrote:
 
   Can the ReRanking API be used to sort within docs retrieved by a date
  field
   ? Can somebody help me understand how to write such a query ?
  
   Thanks
  
   Ravi Kiran Bhaskar

Re: Query ReRanking question

2014-09-05 Thread Walter Underwood

Boosting on recency is probably a better approach. A fixed re-ranking horizon 
will always be a compromise, a guess at the precision of the query. It will 
give poor results for queries that are more or less specific than the 
assumption.

Think of the recency boost as a tie-breaker. When documents are similar in 
relevance, show the most recent. This can work over a wide range of queries.

For “malaysian airlines crash”, there are two sets of relevant documents, one 
set on MH 370 starting six months ago, and one set on MH 17, two months ago. 
But four hours ago, The Guardian published a “six months on” article on MH 370. 
A recency boost will handle that complexity.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Sep 5, 2014, at 10:23 AM, Erick Erickson erickerick...@gmail.com wrote:

 OK, why can't you switch the clauses from Joel's suggestion?
 
 Something like:
 q=Malaysia plane crashrq={!rerank reRankDocs=1000
 reRankQuery=$myquery}myquery=*:*sort=date+desc
 
 (haven't tried this yet, but you get the idea).
 
 Best,
 Erick
 
 On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 Hi - You can already achieve this by boosting on the document's recency. The 
 result set won't be exactly ordered by date but you will get the most 
 relevant and recent documents on top.
 
 Markus
 
 -Original message-
 From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
 Sent: Friday 5th September 2014 18:06
 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org
 Subject: Re: Query ReRanking question
 
 Thank you very much for responding. I want to do exactly the opposite of
 what you said. I want to sort the relevant docs in reverse chronology. If
 you sort by date before hand then the relevancy is lost. So I want to get
 Top N relevant results and then rerank those Top N to achieve relevant
 reverse chronological results.
 
 If you ask Why would I want to do that ??
 
 Lets take a example about Malaysian airline crash. several articles might
 have been published over a period of time. When I search for - malaysia
 airline crash blackbox - I would want to see relevant results but would
 also like to see the the recent developments on the top i.e. effectively a
 reverse chronological order within the relevant results, like telling a
 story over a period of time
 
 Hope i am clear. Thanks for your help.
 
 Thanks
 
 Ravi Kiran Bhaskar
 
 
 On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com 
 mailto:joels...@gmail.com  wrote:
 
 If you want the main query to be sorted by date then the top N docs
 reranked by a query, that should work. Try something like this:
 
 q=foosort=date+descrq={!rerank reRandDocs=1000
 reRankQuery=$myquery}myquery=blah
 
 
 Joel Bernstein
 Search Engineer at Heliosearch
 
 
 On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com 
 mailto:ravis...@gmail.com  wrote:
 
 Can the ReRanking API be used to sort within docs retrieved by a date
 field
 ? Can somebody help me understand how to write such a query ?
 
 Thanks
 
 Ravi Kiran Bhaskar

Re: Edismax mm and efficiency

2014-09-05 Thread Walter Underwood

Great!

We have some very long queries, where students paste entire homework problems. 
One of them was 1051 words. Many of them are over 100 words. This could help.

In the Jira discussion, I saw some comments about handling the most sparse 
lists first. We did something like that in the Infoseek Ultra engine about 
twenty years ago. Short termlists (documents matching a term) were processed 
first, which kept the in-memory lists of matching docs small. It also allowed 
early short-circuiting for no-hits queries.

What would be a high mm value, 75%?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com 
wrote:

 indeed https://issues.apache.org/jira/browse/LUCENE-4571
 my feeling is it gives a significant gain in mm high values.
 
 
 
 On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org
 wrote:
 
 Are there any speed advantages to using “mm”? I can imagine pruning the
 set of matching documents early, which could help, but is that (or
 something else) done?
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/
 
 
 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com

Re: SolrJ 4.10.0 errors

2014-09-05 Thread Shawn Heisey

On 9/5/2014 3:50 AM, Guido Medina wrote:
 Sorry I didn't give enough information so I'm adding to it, the SolrJ
 client is on our webapp and the documents are getting indexed properly
 into Solr, the only problem we are seeing is that with SolrJ 4.10 once
 Solr server response comes back it seems like SolrJ client doesn't know
 what to with such response and reports the exception I mentioned, I then
 downgraded the SolrJ client to 4.9 and the exception is now gone, I'm
 using the following relevant libraries:
 
 Java 7u67 64 bits at both webapp client side side and Jetty's
 HTTP client/mine 4.3.5
 HTTP core 4.3.2
 
 Here is a list of my Solr war modified lib folder, I usually don't stay
 with the standard jars because I believe most of them are out of date if
 you are running a JDK 7u55+:

You're in uncharted territory if you're going to modify the jars
included with Solr itself.  We do upgrade these from time to time, and
usually it's completely harmless, but we also run all the tests when we
do it, to make sure that nothing will get broken.  Some of the
components are on specific versions because upgrading them isn't as
simple as simply changing the jar.

What happens if you return Solr to what's in the release war?

Thanks,
Shawn

RE: How to implement multilingual word components fields schema?

2014-09-05 Thread Susheel Kumar

Agree with the approach Jack suggested to use same source text in multiple 
fields for each language and then doing a dismax query.  Would love to hear if 
it works for you?

Thanks,
Susheel

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, September 05, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well for 
language detection, but not the latter, unless all fields of a given document 
are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-Original Message-
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different languages 
parts and seach queries of the same complexity, and it is a worldwide used 
online application, so users generate content in all the possible world 
languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would 
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their commercial 
plugins and it defines tokenizer/filter language per field type, which is not a 
universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. The Digital Group is not responsible for errors or 
omissions in this message and denies any responsibility for any damage arising 
from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  
any material which could be reasonably branded to be a species of plagiarism 
and other statements contained in this message and any attachment are solely 
those of the author and do not necessarily represent those of the company.

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West

Hi Ilia,

I don't know if it would be helpful but below I've listed some academic
papers on this issue of how best to deal with mixed language/mixed script
queries and documents. They are probably taking a more complex approach
than you will want to use, but perhaps they will help to think about the
various ways of approaching the problem.

We haven't tackled this problem yet. We have over 200 languages. Currently
we are using the ICUTokenizer and ICUFolding filter but don't do any
stemming due to a concern with overstemming (we have very high recall, so
don't want to hurt precision by stemming) and the difficulty of correct
language identification of short queries.

If you have languages where there is only one language per script however,
you might be able to do much more. I'm not sure if I'm remembering
correctly but I believe some of the stemmers such as the Greek stemmer will
pass through any strings that don't contain characters in the Greek script.
So it might be possible to at least do stemming on some of your
languages/scripts.

I'll be very interested to learn what approach you end up using.

Tom

Some papers:

Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and
weighting of multilingual and mixed documents. In *Proceedings of the South
African Institute of Computer Scientists and Information Technologists
Conference on Knowledge, Innovation and Leadership in a Diverse,
Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA,
161-170. DOI=10.1145/2072221.2072240
http://doi.acm.org/10.1145/2072221.2072240

That paper and some others are here:
http://www.husseinsspace.com/research/students/mohammedmustafaali.html

There is also some code from this article:

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo
Rosso. 2014. Query expansion for mixed-script information retrieval.
In *Proceedings
of the 37th international ACM SIGIR conference on Research development in
information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622

Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search

On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii sreten...@multivi.ru
wrote:

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?

http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

RE: Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Susheel Kumar

There is SmartChineseSentenceTokenizerFactory or SentenceTokenizer  which is 
getting being deprecated  replaced with HMMChineseTokenizer.  Not aware of 
other tokenizer but you may to either build your own similar to 
SentenceTokenizer or employ any external Sentence detection/recognizer  built 
Solr tokenizer on top of it.

Don't know how complex your use case is but I would suggest to look 
SentenceTokenizer and create similar tokenizer.

Thanks,
Susheel

-Original Message-
From: Sandeep B A [mailto:belgavi.sand...@gmail.com]
Sent: Friday, September 05, 2014 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Sorry for typo it is solr 4.9.0 instead of sold 4.9.0  On Sep 5, 2014 7:48 PM, 
Sandeep B A belgavi.sand...@gmail.com wrote:

 Hi,

 I was looking out the options for sentence tokenizers default in solr
 but could not find it. Does any one used? Integrated from any other
 language tokenizers to solr. Example python etc.. Please let me know.


 Thanks and regards,
 Sandeep

This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. The Digital Group is not responsible for errors or 
omissions in this message and denies any responsibility for any damage arising 
from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  
any material which could be reasonably branded to be a species of plagiarism 
and other statements contained in this message and any attachment are solely 
those of the author and do not necessarily represent those of the company.

Re: Query ReRanking question

2014-09-05 Thread Ravi Solr

Erick, I believe when you apply sort this way it runs the query and sort
first and then tries to rerank...so basically it already lost the true
relevancy because of sort taking precedence. Am I making sense ?

Ravi Kiran Bhaskar


On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson erickerick...@gmail.com
wrote:

 OK, why can't you switch the clauses from Joel's suggestion?

 Something like:
 q=Malaysia plane crashrq={!rerank reRankDocs=1000
 reRankQuery=$myquery}myquery=*:*sort=date+desc

 (haven't tried this yet, but you get the idea).

 Best,
 Erick

 On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Hi - You can already achieve this by boosting on the document's recency.
 The result set won't be exactly ordered by date but you will get the most
 relevant and recent documents on top.
 
  Markus
 
  -Original message-
  From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
  Sent: Friday 5th September 2014 18:06
  To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org
  Subject: Re: Query ReRanking question
 
  Thank you very much for responding. I want to do exactly the opposite of
  what you said. I want to sort the relevant docs in reverse chronology.
 If
  you sort by date before hand then the relevancy is lost. So I want to
 get
  Top N relevant results and then rerank those Top N to achieve relevant
  reverse chronological results.
 
  If you ask Why would I want to do that ??
 
  Lets take a example about Malaysian airline crash. several articles
 might
  have been published over a period of time. When I search for - malaysia
  airline crash blackbox - I would want to see relevant results but
 would
  also like to see the the recent developments on the top i.e.
 effectively a
  reverse chronological order within the relevant results, like telling a
  story over a period of time
 
  Hope i am clear. Thanks for your help.
 
  Thanks
 
  Ravi Kiran Bhaskar
 
 
  On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com
 mailto:joels...@gmail.com  wrote:
 
   If you want the main query to be sorted by date then the top N docs
   reranked by a query, that should work. Try something like this:
  
   q=foosort=date+descrq={!rerank reRandDocs=1000
   reRankQuery=$myquery}myquery=blah
  
  
   Joel Bernstein
   Search Engineer at Heliosearch
  
  
   On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com
 mailto:ravis...@gmail.com  wrote:
  
Can the ReRanking API be used to sort within docs retrieved by a
 date
   field
? Can somebody help me understand how to write such a query ?
   
Thanks
   
Ravi Kiran Bhaskar

Re: Query ReRanking question

2014-09-05 Thread Ravi Solr

Walter, thank you for the valuable insight. The problem I am facing is that
between the term frequencies, mm, date boost and stemming the results can
become very inconsistent...Look at the following examples

Here the chronology is all over the place because of what I mentioned above
http://www.washingtonpost.com/pb/newssearch/?query=malaysian+airline+crash

Now take the instance of an old topic/news which was covered a a while ago
for a period of time but not actively updated recently...In this case, the
date boosting predominantly takes over because of common terms and we get a
rash of irrelevant content

http://www.washingtonpost.com/pb/newssearch/?query=faces+of+the+fallen

This has become such a balancing act and hence I was looking to see if
reRanking might help

Thanks

Ravi Kiran Bhaskar





On Fri, Sep 5, 2014 at 1:32 PM, Walter Underwood wun...@wunderwood.org
wrote:

 Boosting on recency is probably a better approach. A fixed re-ranking
 horizon will always be a compromise, a guess at the precision of the query.
 It will give poor results for queries that are more or less specific than
 the assumption.

 Think of the recency boost as a tie-breaker. When documents are similar in
 relevance, show the most recent. This can work over a wide range of queries.

 For “malaysian airlines crash”, there are two sets of relevant documents,
 one set on MH 370 starting six months ago, and one set on MH 17, two months
 ago. But four hours ago, The Guardian published a “six months on” article
 on MH 370. A recency boost will handle that complexity.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/


 On Sep 5, 2014, at 10:23 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  OK, why can't you switch the clauses from Joel's suggestion?
 
  Something like:
  q=Malaysia plane crashrq={!rerank reRankDocs=1000
  reRankQuery=$myquery}myquery=*:*sort=date+desc
 
  (haven't tried this yet, but you get the idea).
 
  Best,
  Erick
 
  On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
  markus.jel...@openindex.io wrote:
  Hi - You can already achieve this by boosting on the document's
 recency. The result set won't be exactly ordered by date but you will get
 the most relevant and recent documents on top.
 
  Markus
 
  -Original message-
  From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com 
  Sent: Friday 5th September 2014 18:06
  To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org
  Subject: Re: Query ReRanking question
 
  Thank you very much for responding. I want to do exactly the opposite
 of
  what you said. I want to sort the relevant docs in reverse chronology.
 If
  you sort by date before hand then the relevancy is lost. So I want to
 get
  Top N relevant results and then rerank those Top N to achieve relevant
  reverse chronological results.
 
  If you ask Why would I want to do that ??
 
  Lets take a example about Malaysian airline crash. several articles
 might
  have been published over a period of time. When I search for - malaysia
  airline crash blackbox - I would want to see relevant results but
 would
  also like to see the the recent developments on the top i.e.
 effectively a
  reverse chronological order within the relevant results, like telling a
  story over a period of time
 
  Hope i am clear. Thanks for your help.
 
  Thanks
 
  Ravi Kiran Bhaskar
 
 
  On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com
 mailto:joels...@gmail.com  wrote:
 
  If you want the main query to be sorted by date then the top N docs
  reranked by a query, that should work. Try something like this:
 
  q=foosort=date+descrq={!rerank reRandDocs=1000
  reRankQuery=$myquery}myquery=blah
 
 
  Joel Bernstein
  Search Engineer at Heliosearch
 
 
  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com
 mailto:ravis...@gmail.com  wrote:
 
  Can the ReRanking API be used to sort within docs retrieved by a date
  field
  ? Can somebody help me understand how to write such a query ?
 
  Thanks
 
  Ravi Kiran Bhaskar

How to solve?

2014-09-05 Thread William Bell

We have a core with each document as a person.

We want to boost based on the sweater color, but if the person has sweaters
in their closet which are the same manufactuer we want to boost even more
by adding them together.

Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater:
Blue=1 : Polo
Tony S - Sweater: Red =2: Nike
 Bill O - Sweater:Red = 2: Polo, Blue=1: Polo

Scores:

Peter Smit - 1+2 = 3.
Tony S - 2
Bill O - 2 + 1

I thought about using payloads.

sweaters_payload
Blue: Nike: 1
Red: Nike: 2
Blue: Polo: 1

How do I query this?

http://localhost:8983/solr/persons?q=*:*sort=??

Ideas?




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

37 matches

Mail list logo