Re: Caused by: org.noggit.JSONParser$ParseException: Expected ',' or '}': char=",position=312 BEFORE='ssions"

2017-04-25 Thread Fuad Efendi
Yes, absolutely correct, comma is missing at the end of line 10

All key-value pairs inside the same block should be comma separated, except
last one



From: Shawn Heisey  
Reply: solr-user@lucene.apache.org 

Date: April 25, 2017 at 2:29:03 PM
To: solr-user@lucene.apache.org 

Subject:  Re: Caused by: org.noggit.JSONParser$ParseException: Expected ','
or '}': char=",position=312 BEFORE='ssions"

On 4/25/2017 12:10 PM, bay chae wrote:
>
https://stackoverflow.com/questions/43618000/solr-standalone-basicauth-org-noggit-jsonparserparseexception
<
https://stackoverflow.com/questions/43618000/solr-standalone-basicauth-org-noggit-jsonparserparseexception>

>
> Hi I am following guides on security.json in
https://cwiki.apache.org/confluence/display/solr/Rule-Based+Authorization+Plugin
<
https://cwiki.apache.org/confluence/display/solr/Rule-Based+Authorization+Plugin>.

>
> But when solr starts up I am getting:
>
> Caused by: org.noggit.JSONParser$ParseException: Expected ',' or '}':
char=",position=312 BEFORE='ssions":[{"name":"security-edit",
"role":"admin"}] "' AFTER='user-role":{"solr":"admin"} }}

Looks like the JSON on that documentation page is incorrect, and has
been wrong for a very long time. It doesn't validate when run through a
JSON validator. If I add a comma at the end of line 10 (just before
"user-role"), then it validates. I do not know whether this is the
correct fix, but I think it probably is.

Before I update the documentation, I would like somebody who's familiar
with this file to tell me whether I've got the right fix.

Thanks,
Shawn


Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Fuad Efendi
Walter, I use BM25 which is default for Solr 6.3, and I clearly visually
saw correlation between number of hits and response times in Solr logs, it
is almost linear.   With underloaded system.

With “solrmeter” 10-requests-per-second CPU goes to 400% on
12-core-hyperthread machine, and with 20-requests-per-second goes to 1100%.
No issues with GC. Java 8  121 from Oracle, 64-bit. 20 requests per second,
Solr 6, (to SOlr) kidding? I never expected that for simplest queries

Doug, I was never been able to make “mm” parameter work for me; I cannot
understand how it works. I use eDisMax, and few “text_general” fields, with
default for Solr operator “OR”, and default “mm” (which should be “1” for
“OR)




From: Walter Underwood <wun...@wunderwood.org> <wun...@wunderwood.org>
Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Date: February 21, 2017 at 5:24:23 PM
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Subject:  Re: CPU Intensive Scoring Alternatives

300 ms seems pretty good for 200 million documents. Is that average?
Median? 95th percentile?

Why are you sure it is because the huge number of hits? That would be
unusual. The size of the posting lists is a more common cause.

Why do you think it is caused by tf.idf? That should be faster than BM25.

Does host have enough RAM to hold most or all of the index in file buffers?

What are the hit rates on your caches?

Are you using fuzzy matches? N-gram prefix matching? Phrase matching?
Shingles?

What version of Java are you running? What garbage collector?

wunder
Walter Underwood
wun...@wunderwood.org <mailto:wun...@wunderwood.org>
http://observer.wunderwood.org/ (my blog)


> On Feb 21, 2017, at 10:42 AM, Doug Turnbull <
dturnb...@opensourceconnections.com > wrote:
>
> With that many documents, why not start with an AND search and reissue an
> OR query if there's no results? My strategy is to prefer an AND for large
> collections (or a higher mm than 1) and prefer closer to an OR for
smaller
> collections.
>
> -Doug
>
> On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi <f...@efendi.ca > wrote:
>
>> Thank you Ahmet, I will try it; sounds reasonable
>>
>>
>> From: Ahmet Arslan <iori...@yahoo.com.invalid > <iori...@yahoo.com.invalid >
>> Reply: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> <
solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>>
>> <solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>>,
Ahmet Arslan <iori...@yahoo.com <mailto:iori...@yahoo.com>>
>> <iori...@yahoo.com <mailto:iori...@yahoo.com>>
>> Date: February 21, 2017 at 3:02:11 AM
>> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> <
solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>>
>> <solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>>
>> Subject: Re: CPU Intensive Scoring Alternatives
>>
>> Hi,
>>
>> New default similarity is BM25.
>> May be explicitly set similarity to tf-idf and see how it goes?
>>
>> Ahmet
>>
>>
>> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <f...@efendi.ca
<mailto:f...@efendi.ca>> wrote:
>> Hello,
>>
>>
>> Default TF-IDF performs poorly with the indexed 200 millions documents.
>> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
>> seconds. eDisMax. Because default operator "OR" and stopword "The" we
have
>> 50-70 millions documents as a query result, and scoring is CPU
intensive.
>> What to do? Our typical queries return over million documents, and
response
>> times of simple queries ranges from 50 milliseconds to 5-10 seconds
>> depending on result set.
>>
>> This was just an exaggerated example with stopword “the”, but even
simplest
>> query “Michael Jackson” runs 300ms instead of 3ms just because huge
number
>> of hits and TF-IDF calculations. Solr 6.3.
>>
>>
>> Thanks,
>>
>> --
>>
>> Fuad Efendi
>>
>> (416) 993-2060
>>
>> http://www.tokenizer.ca <http://www.tokenizer.ca/>
>> Search Relevancy, Recommender Systems
>>


Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Fuad Efendi
Thank you Ahmet, I will try it; sounds reasonable


From: Ahmet Arslan <iori...@yahoo.com.invalid> <iori...@yahoo.com.invalid>
Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>, Ahmet Arslan <iori...@yahoo.com>
<iori...@yahoo.com>
Date: February 21, 2017 at 3:02:11 AM
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Subject:  Re: CPU Intensive Scoring Alternatives

Hi,

New default similarity is BM25.
May be explicitly set similarity to tf-idf and see how it goes?

Ahmet


On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <f...@efendi.ca> wrote:
Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

-- 

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


CPU Intensive Scoring Alternatives

2017-02-20 Thread Fuad Efendi
Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


Re: Solr 5.5.0 MSSQL Datasource Example

2017-02-07 Thread Fuad Efendi
Perhaps this answers your question:


http://stackoverflow.com/questions/27418875/microsoft-sqlserver-driver-datasource-have-password-empty


Try different one as per Eclipse docs,

http://www.eclipse.org/jetty/documentation/9.4.x/jndi-datasource-examples.html




 

 jdbc/DSTest

 



   user

   pass

   dbname

   localhost

   1433



 






--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


From: Per Newgro <per.new...@gmx.ch> <per.new...@gmx.ch>
Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Date: February 7, 2017 at 10:15:42 AM
To: solr-user-group <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Subject:  Solr 5.5.0 MSSQL Datasource Example

Hello,

has someone a working example for MSSQL Datasource with 'Standard Microsoft
SQL Driver'.

My environment:
debian
Java 8
Solr 5.5.0 Standard (download and installed as service)

server/lib/ext
sqljdbc4-4.0.jar

Global JNDI resource defined
server/etc/jetty.xml


java:comp/env/jdbc/mydb


ip
mydb
user
password




or 2nd option tried


java:comp/env/jdbc/mydb


jdbc:sqlserver://ip;databaseName=mydb;
user
password





collection1/conf/db-data-config.xml


...

This leads to SqlServerException login failed for user.
at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)

at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:254)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:84)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2908)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:2234)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:2220)

at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1326)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:991)

at
com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:827)

at
com.microsoft.sqlserver.jdbc.SQLServerDataSource.getConnectionInternal(SQLServerDataSource.java:621)

at
com.microsoft.sqlserver.jdbc.SQLServerDataSource.getConnection(SQLServerDataSource.java:57)

at
org.apache.solr.handler.dataimport.JdbcDataSource$1.getFromJndi(JdbcDataSource.java:256)

at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:182)

at
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172)

at
org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:463)

at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:309)

... 12 more

But when i remove the jndi datasource and rewrite the dataimport data
source to


driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" br/>
url="jdbc:sqlserver://ip;databaseName=mydb"
user="user" password="password" />
...

Then it works.
But this way i need to configure the db in every core. I would like to
avoid that.

Thanks
Per


Re: Solr 5.3.1: Collection reload results in IndexWriter is closed exception

2017-02-07 Thread Fuad Efendi
Were you indexing new documents while reloading? “Previously we’ve done
reloads of a collection after changing solrconfig.xml without any issues.”

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


From: Kelly, Frank <frank.ke...@here.com> <frank.ke...@here.com>
Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Date: February 7, 2017 at 12:19:21 PM
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
<solr-user@lucene.apache.org>
Subject:  Solr 5.3.1: Collection reload results in IndexWriter is closed
exception

Just wondering if anyone has seen this before and might understand why this
is happening

Environment:
Solr 5.3.1 in Solr Cloud (3 shards each with 3 replicas across 3 EC2 Vms)
100m documents (20+ GB index)

Previously we’ve done reloads of a collection after changing solrconfig.xml
without any issues.
This time we saw it across 3 of 3 environments where we got several Solr’s
showing “IndexWriter is closed” errors and had to stop and restart those
Solr instances.
In our final environment we skipped the RELOAD and just did solr stop, solr
start.

The solrconfig.xml change we made was turning on the replication handler
(not sure if this has any bearing on the issue)



  96
  6
  6
   
 

Is there anything “unsafe” about reload on a collection that is handling
live traffic in that version?

Cheers!

-Frank

[image: Description: Macintosh
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



*Frank Kelly*

*Principal Software Engineer*



HERE

5 Wayside Rd, Burlington, MA 01803, USA

*42° 29' 7" N 71° 11' 32" W*


[image: Description:
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
<http://360.here.com/>[image: Description:
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
<https://www.twitter.com/here>   [image: Description:
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
<https://www.facebook.com/here>[image: Description:
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
<https://www.linkedin.com/company/heremaps>[image: Description:
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
<https://www.instagram.com/here/>


Re: Help with design choice: join or multiValued field

2017-02-06 Thread Fuad Efendi
Correct: multivalued field with 1 shop IDs. Use case: shopping network
in U.S. for example for a big brand such as Walmart, when user implicitly
provides IP address or explicitly Postal Code, so that we can find items in
his/her neighbourhood.


You basically provide “join” information via this 10,000-sized collection
of IDs per document. It almost doesn’t have any impact on index size. User
query needs to provide list of preferred IDs (if for example we know user’s
geo location). And for this “Walmart” use case you may also need “Available
Online Only” option, etc.


From: Karl Kildén  
Reply: solr-user@lucene.apache.org 

Date: February 6, 2017 at 5:57:41 AM
To: solr-user@lucene.apache.org 

Subject:  Help with design choice: join or multiValued field

Hello!

I have Items and I have Shops. This is a e-commerce system with items from
thousands of shops all though the inventory is often similar between shops.
Some users can shop from any shop and some only from their default one.


One item can exist in about 1 shops.


- When a user logs in they may have a shop pre selected so when they
search for items we need to get all matching documents but if it's' found
in their pre selected shop we should mark it out in the UI.
- They need to be able to filter out only items in their current shop
- Items found in their shop should always be boosted heavily



TLDR:

Either we just have a multiValued field on the item document with all
shops. This would be a multivalued field with 1 rows

Or

Could we have a new document ShopItem that has the shopId and the itemId
(think join table). Then we join this document instead... But we still need
to get the Item document back, and we need bq boosting on item?


Re: Time of insert

2017-02-06 Thread Fuad Efendi
Not; historical logs for document updates is not provided. Users need to
implement such functionality themselves if needed.


From: Mahmoud Almokadem  
Reply: solr-user@lucene.apache.org 

Date: February 6, 2017 at 3:32:34 PM
To: solr-user@lucene.apache.org 

Subject:  Time of insert

Hello,

I'm using dih on solr 6 for indexing data from sql server. The document can
br indexed many times according to the updates on it. Is that available to
get the first time the document inserted to solr?

And how to get the dates of the document updated?

Thanks for help,
Mahmoud


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread Fuad Efendi

*Deserves* to mention: I run Solr on 8080 port, and Firewall blocks *port* 
8080. It is not indeed securing by IP address!

“block by IP” vs. “block by port number”

“block *all* services run on a machine by IP address” vs. “block only Jetty”

and etc.



Still need option for Jetty, it will simplify life ;)




On November 4, 2016 at 12:05:13 PM, Fuad Efendi (f...@efendi.ca) wrote:

Yes we need that documented,

http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr


Of course Firewall is a must for extremely strong environments / large 
corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; 
my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust 
it: what if local at 1and1.com servers (in the same rack for example) can 
bypass this firewall?


Having option to configure Jetty minimizes dependencies. In real production I’d 
use all possible options: firewall(s) + iptable + Jetty config + DMZ(s)


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) 
wrote:

I was just researching how to secure Solr by IP address and I finally
figured it out. Perhaps this might go in the ref guide but I'd like to
share it here anyhow. The scenario is where only "localhost" should have
full unfettered access to Solr, whereas everyone else (notably web clients)
can only access some whitelisted paths. This setup is intended for a
single instance of Solr (not a member of a cluster); the particular config
below would probably need adaptations for a cluster of Solr instances. The
technique here uses a utility with Jetty called IPAccessHandler --
http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html
For reasons I don't know (and I did search), it was recently deprecated and
there's another InetAccessHandler (not in Solr's current version of Jetty)
but it doesn't support constraints incorporating paths, so it's a
non-option for my needs.

First, Java must be told to insist on it's IPv4 stack. This is because
Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws
NPEs in my experience. In recent versions of Solr, this can be easily done
just by adding -Djava.net.preferIPv4Stack=true at the Solr start
invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.

Edit server/etc/jetty.xml, and replace the line
mentioning ContextHandlerCollection with this:




127.0.0.1
-.-.-.-|/solr/techproducts/select


false





This mechanism wraps ContextHandlerCollection (which ultimately serves
Solr) with this handler that adds the constraints. These constraints above
allow localhost to do anything; other IP addresses can only access
/solr/techproducts/select. That line could be duplicated for other
white-listed paths -- I recommend creating request handlers for your use,
possibly with invariants to further constraint what someone can do.

note: I originally tried inserting the IPAccessHandler in
server/contexts/solr-jetty-context.xml but found that there's a bug in
IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo
is null. And it wound up letting everything through (if I recall). But I
like it up in server.xml anyway as it intercepts everything

~ David

--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread Fuad Efendi
Yes we need that documented,

http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr


Of course Firewall is a must for extremely strong environments / large 
corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; 
my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust 
it: what if local at 1and1.com servers (in the same rack for example) can 
bypass this firewall?


Having option to configure Jetty minimizes dependencies. In real production I’d 
use all possible options: firewall(s) + iptable + Jetty config + DMZ(s)


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) 
wrote:

I was just researching how to secure Solr by IP address and I finally  
figured it out. Perhaps this might go in the ref guide but I'd like to  
share it here anyhow. The scenario is where only "localhost" should have  
full unfettered access to Solr, whereas everyone else (notably web clients)  
can only access some whitelisted paths. This setup is intended for a  
single instance of Solr (not a member of a cluster); the particular config  
below would probably need adaptations for a cluster of Solr instances. The  
technique here uses a utility with Jetty called IPAccessHandler --  
http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html
  
For reasons I don't know (and I did search), it was recently deprecated and  
there's another InetAccessHandler (not in Solr's current version of Jetty)  
but it doesn't support constraints incorporating paths, so it's a  
non-option for my needs.  

First, Java must be told to insist on it's IPv4 stack. This is because  
Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws  
NPEs in my experience. In recent versions of Solr, this can be easily done  
just by adding -Djava.net.preferIPv4Stack=true at the Solr start  
invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.  

Edit server/etc/jetty.xml, and replace the line  
mentioning ContextHandlerCollection with this:  

  
  
  
127.0.0.1  
-.-.-.-|/solr/techproducts/select  
  
  
false  
  
  
  
  

This mechanism wraps ContextHandlerCollection (which ultimately serves  
Solr) with this handler that adds the constraints. These constraints above  
allow localhost to do anything; other IP addresses can only access  
/solr/techproducts/select. That line could be duplicated for other  
white-listed paths -- I recommend creating request handlers for your use,  
possibly with invariants to further constraint what someone can do.  

note: I originally tried inserting the IPAccessHandler in  
server/contexts/solr-jetty-context.xml but found that there's a bug in  
IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo  
is null. And it wound up letting everything through (if I recall). But I  
like it up in server.xml anyway as it intercepts everything  

~ David  

--  
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker  
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:  
http://www.solrenterprisesearchserver.com  


Re: Different Sorts based on Different Groups

2016-11-04 Thread Fuad Efendi
Hi Gustatec,


Relevancy tuning is really *huge* area, check this book when you have a
chance: https://www.manning.com/books/relevant-search

Default Solr sorting is based on TF/IDF algorithm; and sorting is not
necessarily ‘relevancy’

Trivial solution for clothes store domain would be this one, better to
explain using examples:

Product 1

Name: "Russell Athletic Men's Basic Tank Top"
Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top”

Product 2

Name: "Russell Athletic Men's Cotton Muscle Shirt"
Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top”


You may notice that first product has “Top” repeated twice in product name
and category; and second one has “Short” repeated twice.

Now having this real-life example you can play with boost query, boosting
results containing words from category name in their product name.

category:”Tank Top” & bq:”name:tank^10 OR name:top^5"


Solr provides “boost query” to tune sorting of output results, check “bq”
parameter in the docs at
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser


I went from real-life scenario; your scenario and possible solutions could
be very different.

I had recently assignment at well-known retail shop where we even designed
pre-query custom boosts so that we can customize typical (most important
for the business) queries as per business needs



Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 10:57:02 AM, Gustatec (gusta...@gmail.com) wrote:

Hello everyone!

I'm currently using Solr in a project (pretty much an e-commerce POC) and
came across with the following sort situation:

I have two products one called Product1 and other one called Product2, both
of them belongs to the same categories, Shirt(ID 1) and Tank-Top(ID 2)

When i query for any of these categories, it returns both of the products,
in the same order.

Is it possible to do some kind of grouping sort in query? So when i query
for category Shirt, it returns first Product1 then Product2 and when i do
the same query for category Tank-Top it would return first Product2 then
Product1?

By asking that i wonder if its possible to make a product more relevant,
based on the query.

So product1 relevancy would be
Category ID | Priority
1 | 1
2 | 2

And product2 would be
Category ID | Priority
1 | 2
2 | 1


Is it possible to achieve this "elevate" funcionality in query?

i thought in doing a _sort field for all categories, but we
are actually talking about a few hundred categories, so i dont know if
would
be viable to create one sort field for each one of them in every single
doc...

Ps: I asks if its achievable that in query because i dont know if there is
any other way of changing the elevate.xml file without having to restart my
solr instance

Sorry for my bad english, and thanks in advance!



-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Different-Sorts-based-on-Different-Groups-tp4304516.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with Password Decryption in Data Import Handler

2016-11-02 Thread Fuad Efendi
Then I can only guess that in current configuration decrypted password is empty 
string.

Try to manually replace some characters in encpwd.txt file to see if you get 
different errors; try to delete this file completely to see if you get 
different errors. Try to add new line in this file; try to change password in 
config file.



On November 2, 2016 at 5:23:33 PM, Jamie Jackson (jamieja...@gmail.com) wrote:

I should have mentioned that I verified connectivity with plain passwords:  

From the same machine that Solr's running on:  

solr@000650cbdd5e:/opt/solr$ mysql -uroot -pOakton153 -h local.mysite.com  
mysite -e "select 'foo' as bar;"  
+-+  
| bar |  
+-+  
| foo |  
+-+  

Also, if I add the plain-text password to the config, it connects fine:  

  


So that is why I claim to have a problem with encryptKeyFile, specifically,  
because I've eliminated general connectivity/authentication problems.  

Thanks,  
Jamie  

On Wed, Nov 2, 2016 at 4:58 PM, Fuad Efendi <f...@efendi.ca> wrote:  

> In MySQL, this command will explicitly allow to connect from  
> remote ICZ2002912 host, check MySQL documentation:  
>  
> GRANT ALL ON mysite.* TO 'root’@'ICZ2002912' IDENTIFIED BY ‘Oakton123’;  
>  
>  
>  
> On November 2, 2016 at 4:41:48 PM, Fuad Efendi (f...@efendi.ca) wrote:  
>  
> This is the root of the problem:  
> "Access denied for user 'root'@'ICZ2002912' (using password: NO) “  
>  
>  
> First of all, ensure that plain (non-encrypted) password settings work for  
> you.  
>  
> Check that you can connect using MySQL client from ICZ2002912 to your  
> MySQL & Co. instance  
>  
> I suspect you need to allow MySQL & Co. to accept connections  
> from ICZ2002912. Plus, check DNS resolution, etc.  
>  
>  
> Thanks,  
>  
>  
> --  
> Fuad Efendi  
> (416) 993-2060  
> http://www.tokenizer.ca  
> Recommender Systems  
>  
>  
> On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com)  
> wrote:  
>  
> I'm at a brick wall. Here's the latest status:  
>  
> Here are some sample commands that I'm using:  
>  
> *Create the encryptKeyFile and encrypted password:*  
>  
>  
> encrypter_password='this_is_my_encrypter_password'  
> plain_db_pw='Oakton153'  
>  
> cd /var/docker/solr_stage2/credentials/  
> echo -n "${encrypter_password}" > encpwd.txt  
> echo -n "${plain_db_pwd}" > plaindbpwd.txt  
> openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k  
> "${encrypter_password}"  
>  
> rm plaindbpwd.txt  
>  
> That generated this as the password, by the way:  
>  
> U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=  
>  
> *Configure DIH configuration:*  
>  
>   
>  
>  driver="org.mariadb.jdbc.Driver"  
> url="jdbc:mysql://local.mysite.com:3306/mysite"  
> user="root"  
> password="U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o="  
> encryptKeyFile="/opt/solr/credentials/encpwd.txt"  
> />  
> ...  
>  
>  
> By the way, /var/docker/solr_stage2/credentials/ is mapped to  
> /opt/solr/credentials/ in the docker container, so that's why the paths  
> *seem* different (but aren't, really).  
>  
>  
> *Authentication error when data import is run:*  
>  
> Exception while processing: question document :  
> SolrInputDocument(fields:  
> []):org.apache.solr.handler.dataimport.DataImportHandlerException:  
> Unable to execute query: select 'foo' as bar; Processing  
> Document # 1  
> at org.apache.solr.handler.dataimport.DataImportHandlerException.  
> wrapAndThrow(DataImportHandlerException.java:69)  
> at org.apache.solr.handler.dataimport.JdbcDataSource$  
> ResultSetIterator.(JdbcDataSource.java:323)  
> at org.apache.solr.handler.dataimport.JdbcDataSource.  
> getData(JdbcDataSource.java:283)  
> at org.apache.solr.handler.dataimport.JdbcDataSource.  
> getData(JdbcDataSource.java:52)  
> at org.apache.solr.handler.dataimport.SqlEntityProcessor.  
> initQuery(SqlEntityProcessor.java:59)  
> at org.apache.solr.handler.dataimport.SqlEntityProcessor.  
> nextRow(SqlEntityProcessor.java:73)  
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(  
> EntityProcessorWrapper.java:244)  
> at org.apache.solr.handler.dataimport.DocBuilder.  
> buildDocument(DocBuilder.java:475)  
> at org.apache.solr.handler.dataimport.DocBuilder.  
> buildDocument(DocBuilder.java:414)  
> at org.apache.solr.handler.dataimport.DocBuilder.  
> doFullDump(DocBuilder.java:329)  
> at org.apache.solr.handler.dataimport.DocBuilder.execute(  
> DocBuilder.java:232)  
> at org.apache.solr.handler.dataimport.DataImporter.  
> doFullImport(Dat

Re: Problem with Password Decryption in Data Import Handler

2016-11-02 Thread Fuad Efendi
In MySQL, this command will explicitly allow to connect from remote ICZ2002912 
host, check MySQL documentation:

GRANT ALL ON mysite.* TO 'root’@'ICZ2002912' IDENTIFIED BY ‘Oakton123’;



On November 2, 2016 at 4:41:48 PM, Fuad Efendi (f...@efendi.ca) wrote:

This is the root of the problem:
"Access denied for user 'root'@'ICZ2002912' (using password: NO) “


First of all, ensure that plain (non-encrypted) password settings work for you.

Check that you can connect using MySQL client from ICZ2002912 to your MySQL & 
Co. instance

I suspect you need to allow MySQL & Co. to accept connections from ICZ2002912. 
Plus, check DNS resolution, etc. 


Thanks,


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Recommender Systems


On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com) wrote:

I'm at a brick wall. Here's the latest status:

Here are some sample commands that I'm using:

*Create the encryptKeyFile and encrypted password:*


encrypter_password='this_is_my_encrypter_password'
plain_db_pw='Oakton153'

cd /var/docker/solr_stage2/credentials/
echo -n "${encrypter_password}" > encpwd.txt
echo -n "${plain_db_pwd}" > plaindbpwd.txt
openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k
"${encrypter_password}"

rm plaindbpwd.txt

That generated this as the password, by the way:

U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=

*Configure DIH configuration:*




...


By the way, /var/docker/solr_stage2/credentials/ is mapped to
/opt/solr/credentials/ in the docker container, so that's why the paths
*seem* different (but aren't, really).


*Authentication error when data import is run:*

Exception while processing: question document :
SolrInputDocument(fields:
[]):org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: select 'foo' as bar; Processing
Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:323)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:283)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:52)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
Caused by: java.sql.SQLInvalidAuthorizationSpecException: Could not
connect: Access denied for user 'root'@'ICZ2002912' (using password:
NO)
at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:123)
at 
org.mariadb.jdbc.internal.util.ExceptionMapper.throwException(ExceptionMapper.java:71)
at org.mariadb.jdbc.Driver.connect(Driver.java:109)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:192)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:503)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:313)
... 12 more
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Could
not connect: Access denied for user 'root'@'ICZ2002912' (using
password: NO)
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.authentication(AbstractConnectProtocol.java:524)
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:472)
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:374)
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:763)
at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:469)
at org.mariadb.jdbc.Driver.connect(Driver.java:104)
... 16 more



On Thu, Oct 6, 2016 at 2:42 PM, Jamie Jackson <jamieja...@gmail.com> wrote:

> It happens to be ten characters.
>
> On Thu, Oct 6, 2016 at 12:44 PM, Alexandre Rafalovitch <arafa...@gmail.com
> > wrote:
>
>> How long is the encryption key (file content)? Because the code I am
>> looking at seems to expect it to be at most 100 ch

Re: Problem with Password Decryption in Data Import Handler

2016-11-02 Thread Fuad Efendi
This is the root of the problem:
"Access denied for user 'root'@'ICZ2002912' (using password: NO) “


First of all, ensure that plain (non-encrypted) password settings work for you.

Check that you can connect using MySQL client from ICZ2002912 to your MySQL & 
Co. instance

I suspect you need to allow MySQL & Co. to accept connections from ICZ2002912. 
Plus, check DNS resolution, etc. 


Thanks,


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Recommender Systems


On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com) wrote:

I'm at a brick wall. Here's the latest status:  

Here are some sample commands that I'm using:  

*Create the encryptKeyFile and encrypted password:*  


encrypter_password='this_is_my_encrypter_password'  
plain_db_pw='Oakton153'  

cd /var/docker/solr_stage2/credentials/  
echo -n "${encrypter_password}" > encpwd.txt  
echo -n "${plain_db_pwd}" > plaindbpwd.txt  
openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k  
"${encrypter_password}"  

rm plaindbpwd.txt  

That generated this as the password, by the way:  

U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=  

*Configure DIH configuration:*  

  

  
...  


By the way, /var/docker/solr_stage2/credentials/ is mapped to  
/opt/solr/credentials/ in the docker container, so that's why the paths  
*seem* different (but aren't, really).  


*Authentication error when data import is run:*  

Exception while processing: question document :  
SolrInputDocument(fields:  
[]):org.apache.solr.handler.dataimport.DataImportHandlerException:  
Unable to execute query: select 'foo' as bar; Processing  
Document # 1  
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:323)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:283)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:52)
  
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
  
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
  
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
  
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
  
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
  
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)  
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)  
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
  
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)  
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) 
 
Caused by: java.sql.SQLInvalidAuthorizationSpecException: Could not  
connect: Access denied for user 'root'@'ICZ2002912' (using password:  
NO)  
at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:123) 
 
at 
org.mariadb.jdbc.internal.util.ExceptionMapper.throwException(ExceptionMapper.java:71)
  
at org.mariadb.jdbc.Driver.connect(Driver.java:109)  
at 
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:192)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:503)
  
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:313)
  
... 12 more  
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Could  
not connect: Access denied for user 'root'@'ICZ2002912' (using  
password: NO)  
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.authentication(AbstractConnectProtocol.java:524)
  
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:472)
  
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:374)
  
at 
org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:763)
  
at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:469)  
at org.mariadb.jdbc.Driver.connect(Driver.java:104)  
... 16 more  



On Thu, Oct 6, 2016 at 2:42 PM, Jamie Jackson <jamieja...@gmail.com> wrote:  

> It happens to be ten characters.  
>  
> On Thu, Oct 6, 2016 at 12:44 PM, Alexandre Rafalovitch <arafa...@gmail.com  
> > wrote:  
>  
>> How long is the encryption key (file content)? Because the code I am  
>> looking at seems to expect it to be at most 100 characters.  
>>  
>> Regards,  
>> Alex.  
>>   
>> Newsletter and res

Re: Timeout occured while waiting response from server at: http://***/solr/commodityReview

2016-11-02 Thread Fuad Efendi
My 2 cents (rounded):

Quote: "the size of our index data is more than 30GB every year now”

- is it the size of *data* or the size of *index*? This is super important!

You can have petabytes of data, growing terabytes a year, and your index files 
will grow only few gigabytes a year at most.

Note also that Lucene index files are immutable: it means that, for example, if 
your index files total size is 25Gb in a filesystem, then having at least 
25Gb+2Gb of free RAM available (for index files + for OS) will be beneficial 
(as already mentioned in this thread).

However, caching of index files in a RAM won’t reduce search performance from 
minutes of response time to milliseconds. If you really have timeouts (and I 
believe you use at least 60 seconds timeout settings for SolrJ) then possible 
reasons could be:

1. “Shared VM” such as Amazon shared nodes, sometimes they just stop for few 
minutes
2. Garbage collection in Java
3. Sophisticated Solr query such as faceting and aggregations, with 
inadequately configured field cache and other caches


Having 100Gb index files in a filesystem cannot cause more than a few 
milliseconds response times for trivial queries such as “text:Solr”! 
(Exception: faceting)

You need to isolate (troubleshoot) your timeouts, and you mentioned it only 
happens during new queries to the new searcher after replication from Master to 
Slave. Which means Case #3: improperly configured cache parameters. You need 
warm-up query. New Solr searcher will become available after internal caches 
warmed up (prepopulated with data). 

Memory estimate example: suppose you configured Solr such a way that it will 
use field cache for SKU field. Suppose SKU field is 64 bytes in average (UTF8 
will take 2 bytes per character), and you have 100 millions of documents. Then 
you will need 6,400,000,000 bytes for just this instance of a field cache, more 
than 4Gb! This is basic formula. If you have few such fields, then you will 
need ton of memory, and you need few minutes to warm-up field cache. Calculate 
it properly: 8Gb or 24Gb? Consider sharding / SolrCloud if you need huge memory 
just for field cache. And you will be forced to consider it if you gave more 
that 2 billions documents (am I right? Lucene internal limitation, 
Integer.MAX_INT)



Thanks,


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy and Recommender Systems


On November 2, 2016 at 1:11:10 PM, Erick Erickson (erickerick...@gmail.com) 
wrote:

You need to move to SolrCloud when it's  
time to shard ;).  

More seriously, at some point simply adding more  
memory will not be adequate. Either your JVM  
heap will to grow to a point where you start encountering  
GC pauses or the time to serve requests will  
increase unacceptably. "when?" you ask? well  
unfortunately there are no guidelines that can be  
guaranteed, here's a long blog on the subject:  

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
  

The short form is you need to stress-test your  
index and query patterns.  

Now, I've seen 20M docs strain a 32G Java heap. I've  
seen 300M docs give very nice response times with  
12G of memory. It Depends (tm).  

Whether to put Solr on bare metal or not: There's  
inevitably some penalty for a VM. That said there are lots  
of places that use VMs successfully. Again, stress  
testing is the key.  

And finally, using docValues for any field that sorts,  
facets or groups will reduce the JVM requirements  
significantly, albeit by using OS memory space, see  
Uwe's excellent blog:  

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html  

Best,  
Erick  

On Tue, Nov 1, 2016 at 10:23 PM, Kent Mu <solr.st...@gmail.com> wrote:  
> Thanks, I got it, Erick!  
>  
> the size of our index data is more than 30GB every year now, and it is  
> still growing up, and actually our solr now is running on a virtual  
> machine. so I wonder if we need to deploy solr in a physical machine, or I  
> can just upgrade the physical memory of our Virtual machines?  
>  
> Best,  
> Kent  
>  
> 2016-11-02 11:33 GMT+08:00 Erick Erickson <erickerick...@gmail.com>:  
>  
>> Kent: OK, I see now. Then a minor pedantic point...  
>>  
>> It'll avoid confusion if you use master and slaves  
>> rather than master and replicas when talking about  
>> non-cloud setups.  
>>  
>> The equivalent in SolrCloud is leader and replicas.  
>>  
>> No big deal either way, just FYI.  
>>  
>> Best,  
>> Erick  
>>  
>> On Tue, Nov 1, 2016 at 8:09 PM, Kent Mu <solr.st...@gmail.com> wrote:  
>> > Thanks a lot for your reply, Shawn!  
>> >  
>> > no other applications on the server, I agree with you that we need to  
>> > upgrade physical memory, and allocat

Re: Timeout occured while waiting response from server at: http://***/solr/commodityReview

2016-11-01 Thread Fuad Efendi
Quote:
It takes place not often. after analysis, we find that only when the 
replicas Synchronous Data from master solr server. it seem that when the 
replicas block search requests when synchronizing data from master, is that 
true? 


Solr makes new searcher available after replication complete, and new *trivial* 
searches should take milliseconds of response time even with zero cache tunings 
including OS managed caches for filesystem. 

However, if first search coming uses faceting (which uses field caches) then it 
may takes from seconds to minutes to many minutes just to warm up internal 
caches.

Solr has the way to warm up internal caches before making new searcher 
available: 
https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig

Make this queries typical for your use cases (for instance, *:* with faceting):


  
  
  




Thanks,

--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy and Recommender Systems


On November 1, 2016 at 12:07:50 PM, Kent Mu (solr.st...@gmail.com) wrote:

Hi friends!  
We come across an issue when we use the solrj(4.9.1) to connect to solr  
server, our deployment is one master with 10 replicas. we index data to the  
master, and search data from the replicas via load balancing.  

the error stack is as below:  

*Timeout occured while waiting response from server at:  
http://review.solrsearch3.cnsuning.com/solr/commodityReview  
<http://review.solrsearch3.cnsuning.com/solr/commodityReview>*  
org.apache.solr.client.solrj.SolrServerException: Timeout occured while  
waiting response from server at:  
http://review.solrsearch3.cnsuning.com/solr/commodityReview  
at  
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
  
~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05]  
at  
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
  
~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05]  
at  
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
  
~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05]  
at  
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) 
 
~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05]  
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310)  
~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05]  

It takes place not often. after analysis, we find that only when the  
replicas Synchronous Data from master solr server. it seem that when the  
replicas block search requests when synchronizing data from master, is that  
true?  
I wonder if it is because that our solr server hardware configuration  
is too low? the physical memory is 8G with 4 cores. and the JVM we set is  
Xms512m, Xmx7168m.  

looking forward to your reply.  

Thanks!  


Foot, Inch: Stripping Out Special Characters: DisMax: WhitespaceTokenizer vs. Keyword Tokenizer

2016-03-10 Thread Fuad Efendi
Hello,


I finally got it work: search for 5’ 3” (5 feet 3 inches)

It is strange for me that if I use WhitespaceTokenizer for field query-type 
analyzer then it will receive only 5 and 3 with special characters removed.

It is also strange that EDisMax does not strips out odd number of quotes.

But it works fine with KeywordTokenizer.

Any idea why? Thanks,


-- 
Fuad Efendi
http://www.tokenizer.ca
Data Mining, Vertical Search

Re: Stopping Solr JVM on OOM

2016-02-25 Thread Fuad Efendi
The best practice: do not ever try to catch Throwable or its descendants Error, 
VirtualMachineError, OutOfMemoryError, and etc. 

Never ever.

Also, do not swallow InterruptedException in a loop.

Few simple rules to avoid hanging application. If we follow these, there will 
be no question "what is the best way to stop Solr when it gets in OOM” (or just 
becomes irresponsive because of swallowed exceptions)


-- 
Fuad Efendi
416-993-2060(cell)

On February 25, 2016 at 2:37:45 PM, CP Mishra (mishr...@gmail.com) wrote:

Looking at the previous threads (and in our tests), oom script specified at  
command line does not work as OOM exception is trapped and converted to  
RuntimeException. So, what is the best way to stop Solr when it gets in OOM  
state? The only way I see is to override multiple handlers and do  
System.exit() from there. Is there a better way?  

We are using Solr with default Jetty container.  

Thanks,  
CP Mishra  


RE: Solr HTTP client authentication

2014-11-17 Thread Fuad Efendi
  I can 
 manually create an httpclient and set up authentication but then I can't use 
 solrj.

Yes; correct; except that you _can_ use solj with this custom HttpClient 
instance (which will intercept authentication, which will support cookies, SSL 
or plain HTTP, Keep-Alive, and etc.)

You can provide to SolrJ custom HttpClient at construction:

final HttpSolrServer myHttpSolrServer =
new HttpSolrServer(
SOLR_URL_BASE + / + SOLR_CORE_NAME,
myHttpClient);


Best Regards,

http://www.tokenizer.ca


-Original Message-
From: Anurag Sharma [mailto:anura...@gmail.com] 
Sent: November-17-14 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr HTTP client authentication

I think Solr encourage SSL than authentication

On Mon, Nov 17, 2014 at 6:08 PM, Bai Shen baishen.li...@gmail.com wrote:

 I am using solrj to connect to my solr server.  However I need to 
 authenticate against the server and can not find out how to do so 
 using solrj.  Is this possible or do I need to drop solrj?  I can 
 manually create an httpclient and set up authentication but then I can't use 
 solrj.

 Thanks.




Please add me: FuadEfendi

2013-04-05 Thread Fuad Efendi
Hi,

Few months ago I was able to modify Wiki; I can't do it now, probably
because http://wiki.apache.org/solr/ContributorsGroup
 
Please add me: FuadEfendi


Thanks!


-- 
Fuad Efendi, PhD, CEO
C: (416)993-2060
F: (416)800-6479
Tokenizer Inc., Canada
http://www.tokenizer.ca






contributor group

2013-04-05 Thread Fuad Efendi
Hi,

Please add me: FuadEfendi

Thanks!




-- 
http://www.tokenizer.ca






RE: Can SOLR Index UTF-16 Text

2012-10-03 Thread Fuad Efendi
Something is missing from the body of your Email... As I pointed in my
previous message, in general Solr can index _everything_ (provided that
you have Tokenizer for that); but, additionally to _indexing_ you need an
HTTP-based _search_ which must understand UTF-16 (for instance)

Easiest solution is to transfer files to UTF-8 before indexing and to use
UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8
...; including even Tomcat HTTP settings). This is really the simplest...
and fastest by performance... and you should be able to use Highlighter
feature and etc...


-Fuad Efendi
http://www.tokenizer.ca





-Original Message-
From: vybe3142 [mailto:vybe3...@gmail.com] 
Sent: October-03-12 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Can SOLR Index UTF-16 Text

Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work 

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011
634.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Can SOLR Index UTF-16 Text

2012-10-03 Thread Fuad Efendi
Hi, my previous message was partially wrong:


Please note that ANY IMAGINABLE SOLUTION will use encoding/decoding; and the
real question is where should it happen?
A. (Solr) Java Container is responsible for UTF-16 - Java String
B. Client will do UTF-8 -UTF-16 before submitting data to (Solr)
Java Container

And the correct answer is A. Because Java internally stores everything in
UTF-16. So that overhead of (Document)UTF16-(Java)UTF16 is absolutely
minimal (and performance is the best possible; although file sizes could be
higher...)

You need to start SOLR (Tomcat Java) with the parameter 

java -Dfile.encoding=UTF-16

http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html


And, possibly, configure HTTP Connector of Tomcat to UTF-16
Connector port=8080 URIEncoding=UTF-16/

(and use proper encoding HTTP Request Headers when you POST your file to
Solr)



-Fuad Efendi
http://www.tokenizer.ca




-Original Message-
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: October-03-12 1:30 PM
To: solr-user@lucene.apache.org
Subject: RE: Can SOLR Index UTF-16 Text

Something is missing from the body of your Email... As I pointed in my
previous message, in general Solr can index _everything_ (provided that
you have Tokenizer for that); but, additionally to _indexing_ you need an
HTTP-based _search_ which must understand UTF-16 (for instance)

Easiest solution is to transfer files to UTF-8 before indexing and to use
UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8
...; including even Tomcat HTTP settings). This is really the simplest...
and fastest by performance... and you should be able to use Highlighter
feature and etc...


-Fuad Efendi
http://www.tokenizer.ca





-Original Message-
From: vybe3142 [mailto:vybe3...@gmail.com]
Sent: October-03-12 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Can SOLR Index UTF-16 Text

Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work 

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011
634.html
Sent from the Solr - User mailing list archive at Nabble.com.






RE: Can SOLR Index UTF-16 Text

2012-10-02 Thread Fuad Efendi
Solr can index bytearrays too: unigram, bigram, trigram... even bitsets, 
tritsets, qatrisets ;- ) 
LOL I got strong cold... 
BTW, don't forget to configure UTF-8 as your default (Java) container 
encoding...
-Fuad






Re: UnInvertedField limitations

2012-09-06 Thread Fuad Efendi
Hi Jack,


24bit = 16M possibilities, it's clear; just to confirm... the rest is
unclear, why 4-byte can have 4 million cardinality? I thought it is 4
billions...


And, just to confirm: UnInvertedField allows 16M cardinality, correct?




On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote:

It appears that there is a hard limit of 24-bits or 16M for the number of
bytes to reference the terms in a single field of a single document. It
takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
that 
would allow 16/4 or 4 million unique terms - per document. Do you have
such 
large documents? This appears to be a hard limit based of 24-bytes in a
Java 
int.

You can try facet.method=enum, but that may be too slow.

What release of Solr are you running?

-- Jack Krupansky

-Original Message-
From: Fuad Efendi
Sent: Monday, August 20, 2012 4:34 PM
To: Solr-User@lucene.apache.org
Subject: UnInvertedField limitations

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField
.j
ava:668)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java
:4
23)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja
va
:85)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa
nd
ler.java:204)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
e.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca







Re: UnInvertedField limitations

2012-09-06 Thread Fuad Efendi
Hi Lance,


Use case is keyword extraction, and it could be 2- and 3-grams (2- and
3- words); so that theoretically we can have 10,000^3 = 1,000,000,000,000
3-grams for English only... of course my suggestion is to use statistics and
to build a dictionary of such 3-word combinations (remove top, remove
tail, using frequencies)... And to hard-limit this dictionary to 1,000,000...
That was business requirement which technically impossible to implement
(as a realtime query results); we don't even use word stemming etc...




-Fuad




On 12-08-20 7:22 PM, Lance Norskog goks...@gmail.com wrote:

Is this required by your application? Is there any way to reduce the
number of terms?

A work around is to use shards. If your terms follow Zipf's Law each
shard will have fewer than the complete number of terms. For N shards,
each shard will have ~1/N of the singleton terms. For 2-count terms,
1/N or 2/N will have that term.

Now I'm interested but not mathematically capable: what is the general
probabilistic formula for splitting Zipf's Law across shards?

On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky j...@basetechnology.com
wrote:
 It appears that there is a hard limit of 24-bits or 16M for the number
of
 bytes to reference the terms in a single field of a single document. It
 takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes,
that
 would allow 16/4 or 4 million unique terms - per document. Do you have
such
 large documents? This appears to be a hard limit based of 24-bytes in a
Java
 int.

 You can try facet.method=enum, but that may be too slow.

 What release of Solr are you running?

 -- Jack Krupansky

 -Original Message- From: Fuad Efendi
 Sent: Monday, August 20, 2012 4:34 PM
 To: Solr-User@lucene.apache.org
 Subject: UnInvertedField limitations


 Hi All,


 I have a problemŠ  (Yonik, please!) help me, what is Term count limits?
I
 possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

 Thanks!


 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1]
- :
 org.apache.solr.common.SolrException: Too many values for
UnInvertedField
 faceting on field enrich_keywords_string_mv
at
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at
 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedFiel
d.j
 ava:668)
at
 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at
 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.jav
a:4
 23)
at
 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206
)
at
 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.j
ava
 :85)
at
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH
and
 ler.java:204)
at
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
se.
 java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




 --
 Fuad Efendi
 http://www.tokenizer.ca






-- 
Lance Norskog
goks...@gmail.com




RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser

2012-08-25 Thread Fuad Efendi

This is bug in Solr 4.0.0-Beta Schema Browser: Load Term Info shows 9682
News, but direct query shows 3577.

/solr/core0/select?q=channel:Newsfacet=truefacet.field=channelrows=0

response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
lst name=params
str name=facettrue/str
str name=qchannel:News/str
str name=facet.fieldchannel/str
str name=rows0/str
/lst
/lst
result name=response numFound=3577 start=0/
lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=channel
int name=News3577/int
int name=Blogs0/int
int name=Message Boards0/int
int name=Video0/int
/lst
/lst
lst name=facet_dates/
lst name=facet_ranges/
/lst
/response 


-Original Message-
Sent: August-24-12 11:29 PM
To: solr-user@lucene.apache.org
Cc: sole-...@lucene.apache.org
Subject: RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser
Importance: High

Any news? 
CC: Dev


-Original Message-
Subject: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser

Hi there,

Load term Info shows 3650 for a specific term MyTerm, and when I execute
query channel:MyTerm it shows 650 documents foundŠ possibly bugŠ it
happens after I commit data too, nothing changes; and this field is
single-valued non-tokenized string.

-Fuad

--
Fuad Efendi
416-993-2060
http://www.tokenizer.ca






Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser

2012-08-24 Thread Fuad Efendi
Hi there,

Load term Info shows 3650 for a specific term MyTerm, and when I execute
query channel:MyTerm it shows 650 documents foundŠ possibly bugŠ it
happens after I commit data too, nothing changes; and this field is
single-valued non-tokenized string.

-Fuad

-- 
Fuad Efendi
416-993-2060
http://www.tokenizer.ca





RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser

2012-08-24 Thread Fuad Efendi
Any news? 
CC: Dev


-Original Message-
Subject: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser

Hi there,

Load term Info shows 3650 for a specific term MyTerm, and when I execute
query channel:MyTerm it shows 650 documents foundŠ possibly bugŠ it
happens after I commit data too, nothing changes; and this field is
single-valued non-tokenized string.

-Fuad

--
Fuad Efendi
416-993-2060
http://www.tokenizer.ca






Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-20 Thread Fuad Efendi
NRT does not work because index updates hundreds times per second vs.
cache warm-up time few minutesŠ and we are in a loopŠ

 allowing you to query
 your huge index in ms.

Solr also allows to query in ms. What is the difference? No one can sort
1,000,000 terms in descending counts order faster than current Solr
implementation, and FieldCache  UnInvertedCache can't be used together
with NRTŠ cache discarded few times per second!

- Fuad
http://www.tokenizer.ca




On 12-08-14 8:17 AM, Nagendra Nagarajayya
nnagaraja...@transaxtions.com wrote:

You should try realtime NRT available with Apache Solr 4.0 with
RankingAlgorithm 1.4.4, allows faceting in realtime.

RankingAlgorithm 1.4.4 also provides an age feature that allows you to
retrieve the most recent changed docs in realtime, allowing you to query
your huge index in ms.

You can get more information and also download from here:

http://solr-ra.tgels.org

Regards

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external
implementation


On 8/13/2012 11:38 AM, Fuad Efendi wrote:
 SOLR-4.0

 I am trying to implement this; funny idea to share:

 1. http://wiki.apache.org/solr/HierarchicalFaceting
 unfortunately it does not support date ranges. However, workaround: use
 String type instead of *_tdt and define fields such as
 published_hour
 published_day
 published_week
 S(

 Of course you will need to stick with timezone; but you can add an
index(es)
 for each timezone. And most important, string facets are much faster
than
 Date Trie ranges.



 2. Our index is overs 100 millions (from social networks) and rapidly
grows
 (millions a day); cache warm up takes few minutes; Near-Real-Time does
not
 work with faceting.

 HoweverS( another workaround: we can have Daily Core (optimized at
midnight),
 plus Current Core (only today's data, optimized), plus Last Hour Core
(near
 real time)

 Last Hour Data is small enough and we can use Facets with Near Real
Time
 feature

 Service layer will accumulate search results from three layers, it will
be
 near real time.



 Any thoughts? Thanks,









UnInvertedField limitations

2012-08-20 Thread Fuad Efendi

Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Can I
temporarily disable tho feature?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at 
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca





UnInvertedField limitations

2012-08-20 Thread Fuad Efendi
Hi All,


I have a problemŠ  (Yonik, please!) help me, what is Term count limits? I
possibly have 256,000,000 different terms in a fieldŠ or 16,000,000?

Thanks!


2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - :
org.apache.solr.common.SolrException: Too many values for UnInvertedField
faceting on field enrich_keywords_string_mv
at 
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j
ava:668)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4
23)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java
:85)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand
ler.java:204)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)




-- 
Fuad Efendi
http://www.tokenizer.ca





Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set

2012-08-13 Thread Fuad Efendi
SOLR-4.0

I am trying to implement this; funny idea to share:

1. http://wiki.apache.org/solr/HierarchicalFaceting
unfortunately it does not support date ranges. However, workaround: use
String type instead of *_tdt and define fields such as
published_hour
published_day
published_week
Š

Of course you will need to stick with timezone; but you can add an index(es)
for each timezone. And most important, string facets are much faster than
Date Trie ranges.



2. Our index is overs 100 millions (from social networks) and rapidly grows
(millions a day); cache warm up takes few minutes; Near-Real-Time does not
work with faceting.

HoweverŠ another workaround: we can have Daily Core (optimized at midnight),
plus Current Core (only today's data, optimized), plus Last Hour Core (near
real time)

Last Hour Data is small enough and we can use Facets with Near Real Time
feature

Service layer will accumulate search results from three layers, it will be
near real time.



Any thoughts? Thanks,




-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
http://www.tokenizer.ca
http://www.linkedin.com/in/lucene





RE: Using Solr 3.4 running on tomcat7 - very slow search

2012-07-17 Thread Fuad Efendi

 FWIW, when asked at what point one would want to split JVMs and shard, 
 on the same machine, Grant Ingersoll mentioned 16GB, and precisely for 
 GC cost reasons. You're way above that.

- his index is 75G, and Grant mentioned RAM heap size; we can use terabytes
of index with 16Gb memory.







Solr Consultant Available in Canada: Solr, HBase, Hadoop, Mahout, Lily

2012-04-16 Thread Fuad Efendi
Hi,


If anyone is interested, I am available for full-time assignments; I am
involved in Hadoop/Lucene/Solr world since 2005 (Nutch). Recently
implemented Lily-Framework-based distributed task executor which is
currently used for Vertical Search by lead insurance companies and media:
RSS, CVS, Web Services, Moreover, Web Ping, SQL-import, sitemaps-based,
intranets, and more.


Additionally to that, I can design super-rich UI extremely fast using tools
such as Liferay Portal, Apache Wicket, Vaadin.

Thanks,


-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
http://www.tokenizer.ca http://www.tokenizer.ca/
http://www.linkedin.com/in/lucene





Solr Consultant Available in Canada: Solr, HBase, Hadoop, Lily

2012-04-16 Thread Fuad Efendi
Hi,


If anyone is interested, I am available for full-time assignments; I am
involved in Hadoop/Lucene/Solr world since 2005 (Nutch). Recently
implemented Lily-Framework-based distributed task executor which is
currently used for Vertical Search by lead insurance companies and media:
RSS, CVS, Web Services, Moreover, Web Ping, SQL-import, sitemaps-based,
intranets, and more.


Additionally to that, I can design super-rich UI extremely fast using tools
such as Liferay Portal, Apache Wicket, Vaadin.

Thanks,


-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
http://www.tokenizer.ca http://www.tokenizer.ca/
http://www.linkedin.com/in/lucene






Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Fuad Efendi
I agree that SSD boosts performance... In some rare not-real-life scenario:
- super frequent commits
That's it, nothing more except the fact that Lucene compile time including 
tests takes up to two minutes on MacBook with SSD, or forty-fifty minutes on 
Windows with HDD.
Of course, with non-empty maven repository in both scenario, to be fair.


another scenario: imagine google file system is powered by SSD instead of 
cheapest HDD... HAHAHA!!!

Can we expect response time 0.1 milliseconds instead of 30-50?


And final question... Will SSD improve performance of fuzzy search? Range 
queries? Etc



I just want to say that SSD is faster than HDD but it doesn't mean anything...



-Fuad





Sent from my iPad

On 2012-01-19, at 9:40 AM, Peter Velikin pe...@velobit.com wrote:

 All,
 
 Point taken: my message should have been written more succinctly and just 
 stuck to the facts. Sorry for the sales pitch!
 
 However, I believe that adding SSD as a means to accelerate the performance 
 of your Solr cluster is an important topic to discuss on this forum. There 
 are many options for you to consider. I believe VeloBit would be the best 
 option for many, but you have choices, some of them completely free. If 
 interested, send me a note and I'll be happy to tell you about the different 
 options (free or paid) you can consider.
 
 Solr clusters are I/O bound. I am arguing that before you buy additional 
 servers, replace your existing servers with new ones, or swap your hard 
 disks, you should try adding SSD as a cache. If the promise is that adding 1 
 SSD could save you the cost of 3 additional servers, you should try it.
 
 Has anyone else tried adding SSDs as a cache to boost the performance of Solr 
 clusters? Can you share your results?
 
 
 Best regards,
 
 Peter Velikin
 VP Online Marketing, VeloBit, Inc.
 pe...@velobit.com
 tel. 978-263-4800
 mob. 617-306-7165
 
 VeloBit provides plug  play SSD caching software that dramatically 
 accelerates applications at a remarkably low cost. The software installs 
 seamlessly in less than 10 minutes and automatically tunes for fastest 
 application speed. Visit www.velobit.com for details.
 
 
 


Re: jetty error, broken pipe

2011-11-19 Thread Fuad Efendi
It's not Jetty. It is broken TCP pipe due to client-side. It happens when 
client closes TCP connection.

And I even had this problem with recent Tomcat 6.


Problem disappeared after I explicitly tuned keep-alive at Tomcat, and started 
using monitoring thread with HttpClient and SOLRJ... 

Fuad Efendi
http://www.tokenizer.ca




Sent from my iPad

On 2011-11-19, at 9:14 PM, alx...@aim.com wrote:

 Hello,
 
 I use solr 3.4 with jetty that is included in it. Periodically, I see this 
 error in the jetty output
 
 SEVERE: org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at 
 org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at 
 org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:296)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:140)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
 ...
 ...
 ...
 Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.mortbay.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:368)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:129)
at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:161)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:714)
... 25 more
 
 2011-11-19 20:50:00.060:WARN::Committed before 500 
 null||org.mortbay.jetty.EofException|?at 
 org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)|?at 
 org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)|?at
  org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)|?at 
 sun.nio.cs.StreamEncoder.implFlush(S
 
 I searched web and the only advice I get is to upgrade to jetty 6.1, but I 
 think the version included in solr is 6.1.26.
 
 Any advise is appreciated.
 
 
 Thanks.
 Alex.


Re: HBase Datasource

2011-11-10 Thread Fuad Efendi
I am using Lily for atomic index updates ( implemented very nice; 
transactionally; plus MapReduce; plus auto-denormaluzing)

http://www.lilyproject.org

It slows down mean time 7-10 times, but TPS still the same



- Fuad
http://www.tokenizer.ca



Sent from my iPad

On 2011-11-10, at 9:59 PM, Mark static.void@gmail.com wrote:

 Has anyone had any success/experience with building a HBase datasource for 
 DIH? Are there any solutions available on the web?
 
 Thanks.


Re: solr keeps dying every few hours.

2011-08-17 Thread Fuad Efendi
EC2 7.5Gb (large CPU instance, $0.68/hour) sucks. Unpredictably, there are
errors such as

User time: 0 seconds
Kernel time: 0 seconds
Real time: 600 seconds

How can clock time be higher in such extent? Only if _another_ user used
600 seconds CPU: _virtualization_


My client have had constant problems. We are moving to dedicated hardware
(25 times cheaper in average; Amazon sells 1 Tb of EBS for $100/month,
plus additional costs for I/O)


 I have a large ec2 instance(7.5 gb ram), it dies every few hours with out
 of heap memory issues.  I started upping the min memory required,
 currently I use -Xms3072M .



Large CPU instance is virtualization and behaviour is unpredictable.
Choose cluster instance with explicit Intel XEON CPU (instead of
CPU-Units) and compare behaviour; $1.60/hour. Please share results.

Thanks,





-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
Data Mining, Search Engines
http://www.tokenizer.ca








On 11-08-17 5:56 PM, Jason Toy jason...@gmail.com wrote:

I've only set set minimum memory and have not set maximum memory.  I'm
doing
more investigation and I see that I have 100+ dynamic fields for my
documents, not the 10 fields I quoted earlier.  I also sort against those
dynamic fields often,  I'm reading that this potentially uses a lot of
memory.  Could this be the cause of my problems and if so what options do
I
have to deal with this?

On Wed, Aug 17, 2011 at 2:46 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Keep in mind that a commit warms up another searcher and potentially
 doubling
 RAM consumption in the back ground due to cache warming queries being
 executed
 (newSearcher event). Also, where is your Xmx switch? I don't know how
your
 JVM
 will behave if you set Xms  Xmx.

 65m docs is quite a lot but it should run fine with 3GB heap allocation.

 It's a good practice to use a master for indexing without any caches and
 warm-
 up queries when you exceed a certain amount of documents, it will bite.

  I have a large ec2 instance(7.5 gb ram), it dies every few hours with
out
  of heap memory issues.  I started upping the min memory required,
  currently I use -Xms3072M .
  I insert about 50k docs an hour and I currently have about 65 million
 docs
  with about 10 fields each. Is this already too much data for one box?
How
  do I know when I've reached the limit of this server? I have no idea
how
  to keep control of this issue.  Am I just supposed to keep upping the
min
  ram used for solr? How do I know what the accurate amount of ram I
should
  be using is? Must I keep adding more memory as the index size grows,
I'd
  rather the query be a little slower if I can use constant memory and
have
  the search read from disk.




-- 
- sent from my mobile
6176064373




Re: solr keeps dying every few hours.

2011-08-17 Thread Fuad Efendi
I agree with Yonik of course;
ButŠ

You should see OOM errors in this case. In case of virtualization
however it is unpredictableŠ and if JVM doesn't have few bytes to output
OOM into log file (because we are catching throwable and trying to
generate HTTP 500 instead !!! FreakyŠ)

OkŠ

Sorry for not contributing a patchŠ


-Fuad (ZooKeeper)
http://www.OutsideIQ.com







On 11-08-17 6:01 PM, Yonik Seeley yo...@lucidimagination.com wrote:

On Wed, Aug 17, 2011 at 5:56 PM, Jason Toy jason...@gmail.com wrote:
 I've only set set minimum memory and have not set maximum memory.  I'm
doing
 more investigation and I see that I have 100+ dynamic fields for my
 documents, not the 10 fields I quoted earlier.  I also sort against
those
 dynamic fields often,  I'm reading that this potentially uses a lot of
 memory.  Could this be the cause of my problems and if so what options
do I
 have to deal with this?

Yes, that's most likely the problem.
Sorting on an integer field causes a FieldCache entry with an
int[maxDoc] (i.e. 4 bytes per document in the index, regardless of if
it has a value for that field or not).
Sorting on a string field is 4 bytes per doc in the index (the ords)
plus the memory to store the actual unique string values.

-Yonik
http://www.lucidimagination.com



 On Wed, Aug 17, 2011 at 2:46 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

 Keep in mind that a commit warms up another searcher and potentially
 doubling
 RAM consumption in the back ground due to cache warming queries being
 executed
 (newSearcher event). Also, where is your Xmx switch? I don't know how
your
 JVM
 will behave if you set Xms  Xmx.

 65m docs is quite a lot but it should run fine with 3GB heap
allocation.

 It's a good practice to use a master for indexing without any caches
and
 warm-
 up queries when you exceed a certain amount of documents, it will bite.

  I have a large ec2 instance(7.5 gb ram), it dies every few hours
with out
  of heap memory issues.  I started upping the min memory required,
  currently I use -Xms3072M .
  I insert about 50k docs an hour and I currently have about 65 million
 docs
  with about 10 fields each. Is this already too much data for one
box? How
  do I know when I've reached the limit of this server? I have no idea
how
  to keep control of this issue.  Am I just supposed to keep upping
the min
  ram used for solr? How do I know what the accurate amount of ram I
should
  be using is? Must I keep adding more memory as the index size grows,
I'd
  rather the query be a little slower if I can use constant memory and
have
  the search read from disk.




 --
 - sent from my mobile
 6176064373





Re: solr keeps dying every few hours.

2011-08-17 Thread Fuad Efendi
I forgot to add: company from UK, something log related (please have a
look at recent LucidImagination -managed Solr Revolution conference blogs;
company provides log analyzer service; http://loggly.com/) they have
16,000 cores per Solr instance (multi-tenancy); of course they have at
least 100k fields per instanceŠ they don't have any problem outside Amazon
;)))


-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
Data Mining, Search Engines
http://www.tokenizer.ca








On 11-08-17 11:08 PM, Fuad Efendi f...@efendi.ca wrote:

more investigation and I see that I have 100+ dynamic fields for my
documents, not the 10 fields I quoted earlier.  I also sort against
those





Solr Performance Tuning: -XX:+AggressiveOpts

2011-07-27 Thread Fuad Efendi
Anyone tried this? I can not start Solr-Tomcat with following options on
Ubuntu:

JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn256m -XX:MaxPermSize=256m
JAVA_OPTS=$JAVA_OPTS -Dsolr.solr.home=/data/solr -Dfile.encoding=UTF8
-Duser.timezone=GMT
-Djava.util.logging.config.file=/data/solr/logging.properties
-Djava.net.preferIPv4Stack=true
JAVA_OPTS=$JAVA_OPTS -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode  -XX:+AggressiveOpts -XX:NewSize=64m
-XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=77
-XX:+CMSParallelRemarkEnabled
JAVA_OPTS=$JAVA_OPTS -verbose:gc  -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:/data/solr/solr-gc.log


Tomcat log (something about PorterStemFilter; Solr 3.3.0):

INFO: Server startup in 2683 ms
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f5c6f36716e, pid=7713, tid=140034519381760
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# J  org.apache.lucene.analysis.PorterStemFilter.incrementToken()Z
#
[thread 140034523637504 also had an error]
[thread 140034520434432 also had an error]
# An error report file with more information is saved as:
# [thread 140034520434432 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#



However, I can start it and run without any problems by removing
-XX:+AggressiveOpts (which has to be default setting in upcoming releases
Java 6)



Do we need to disable -XX:-DoEscapeAnalysis as IBM suggests?
http://www-01.ibm.com/support/docview.wss?uid=swg21422605



Thanks,
Fuad Efendi

http://www.tokenizer.ca




Re: Solr Performance Tuning: -XX:+AggressiveOpts

2011-07-27 Thread Fuad Efendi
Thanks Robert!!!

Submitted On 26-JUL-2011 - yesterday.

This option was popular in HbaseŠ


On 11-07-27 3:58 PM, Robert Muir rcm...@gmail.com wrote:

Don't use this option, these optimizations are buggy:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134


On Wed, Jul 27, 2011 at 3:56 PM, Fuad Efendi f...@efendi.ca wrote:
 Anyone tried this? I can not start Solr-Tomcat with following options on
 Ubuntu:

 JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn256m -XX:MaxPermSize=256m
 JAVA_OPTS=$JAVA_OPTS -Dsolr.solr.home=/data/solr -Dfile.encoding=UTF8
 -Duser.timezone=GMT
 -Djava.util.logging.config.file=/data/solr/logging.properties
 -Djava.net.preferIPv4Stack=true
 JAVA_OPTS=$JAVA_OPTS -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSIncrementalMode  -XX:+AggressiveOpts -XX:NewSize=64m
 -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=77
 -XX:+CMSParallelRemarkEnabled
 JAVA_OPTS=$JAVA_OPTS -verbose:gc  -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps -Xloggc:/data/solr/solr-gc.log


 Tomcat log (something about PorterStemFilter; Solr 3.3.0):

 INFO: Server startup in 2683 ms
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x7f5c6f36716e, pid=7713, tid=140034519381760
 #
 # JRE version: 6.0_26-b03
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode
 linux-amd64 compressed oops)
 # Problematic frame:
 # J  org.apache.lucene.analysis.PorterStemFilter.incrementToken()Z
 #
 [thread 140034523637504 also had an error]
 [thread 140034520434432 also had an error]
 # An error report file with more information is saved as:
 # [thread 140034520434432 also had an error]
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 #



 However, I can start it and run without any problems by removing
 -XX:+AggressiveOpts (which has to be default setting in upcoming
releases
 Java 6)



 Do we need to disable -XX:-DoEscapeAnalysis as IBM suggests?
 http://www-01.ibm.com/support/docview.wss?uid=swg21422605



 Thanks,
 Fuad Efendi

 http://www.tokenizer.ca






-- 
lucidimagination.com




Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
I think the question is strange... May be you are wondering about possible
OOM exceptions? I think we can pass to Lucene single document containing
comma separated list of term, term, ... (few billion times)... Except
stored and TermVectorComponent...

I believe thousands companies already indexed millions documents with
average size few hundreds Mbytes... There should not be any limits (except
InputSource vs. ByteArray)

100,000 _unique_ terms vs. single document containing 100,000,000,000,000
of non-unique terms (and trying to store offsets)

What about Spell Checker feature? Is anyone tried to index single
terabytes-like document?

Personally, I indexed only small (up to 1000 bytes) documents-fields, but
I believe 500Mb is very common use case with PDFs (which vendors use
Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses
Lucene...)


Fuad




On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote:

From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia
of Michigan Civil War Volunteers in a single document/field, so it's
probably
within the realm of possibility at least G...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are the biggest document fields that you've ever indexed in Solr
or that
 you've heard of?  Ah, it must be Tom's Hathi trust. :)

 I'm asking because I just heard of a case of an index where some
documents
 having a field that can be around 400 MB in size!  I'm curious if
anyone has any
 experience with such monster fields?
 Crazy?  Yes, sure.
 Doable?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/






Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
Hi Otis,


I am recalling pagination feature, it is still unresolved (with default
scoring implementation): even with small documents, searching-retrieving
documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
take few minutes (I saw it with trunk version 6 months ago, and with very
small documents, total 100 mlns docs); it is advisable to restrict search
results to top-1000 in any case (as with Google)...



I believe things can get wrong; yes, most plain-text retrieved from books
should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for
UTF-8)

Theoretically, it doesn't make any sense to index BIG document containing
all terms from dictionary without any terms frequency calcs, but even
with it... I can't imagine we should index 1000s docs and each is just
(different) version of whole Wikipedia, should be wrong design...

Ok, use case: index single HUGE document. What will we do? Create index
with _the_only_ document? And all search will return the same result (or
nothing)? Paginate it; split into pages. I am pragmatic...


Fuad



On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,


 I think the question is strange... May be you are wondering about
possible
 OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...




Re: URGENT HELP: Improving Solr indexing time

2011-06-04 Thread Fuad Efendi
Hi Rohit,

I am currently working on https://issues.apache.org/jira/browse/SOLR-2233
which fixes multithreading issues

How complex is your dataimport schema? SOLR-2233 (multithreading, better
connection handling) improves performance... Especially if SQL is
extremely complex and uses few long-running CachedSqlEntityProcessors and
etc.

Also, check your SQL and indexes, in most cases you can _significantly_
improve performance by simply adding appropriate (for your specific SQL)
indexes. I noticed that even very experienced DBAs sometimes create index
KEY1, KEY2, and developer executes query WHERE KEY2=? ORDER BY KEY1 -
check everything...

Thanks,


-- 
Fuad Efendi
416-993-2060
Tokenizer Inc., Canada
Data Mining, Search Engines
http://www.tokenizer.ca http://www.tokenizer.ca/







On 11-06-05 12:09 AM, Rohit Gupta ro...@in-rev.com wrote:

No didn't double post, my be it was in my outbox and went out again.

The queries outside solr dont take so long, to return around 50 rows
it 
takes 250 seconds, so I am doing a delta import of around 500,000 rows at
a 
time. I have tried turning auto commit  on and things are moving a bit
faster 
now. Are there any more tweeking i can do?

Also, planning to move to master-salve model, but am failing to
understand where 
to start exactly. 

Regards,
Rohit




From: lee carroll lee.a.carr...@googlemail.com
To: solr-user@lucene.apache.org
Sent: Sun, 5 June, 2011 4:59:44 AM
Subject: Re: URGENT HELP: Improving Solr indexing time

Rohit - you have double posted maybe - did Otis's answer not help with
your issue or at least need a response to clarify ?

On 4 June 2011 22:53, Chris Cowan chrisco...@plus3network.com wrote:
 How long does the query against the DB take (outside of Solr)? If
that's slow 
then it's going to take a while to update the index. You might need to
figure a 
way to break things up a bit, maybe use a delta import instead of a full
import.

 Chris

 On Jun 4, 2011, at 6:23 AM, Rohit Gupta wrote:

 My Solr server takes very long to update index. The table it hits to
index is
 huge with 10Million + records , but even in that case I feel this is
very 
long
 time to index. Below is the snapshot of the /dataimport page

 str name=statusbusy/str
 str name=importResponseA command is still running.../str
 lst name=statusMessages
 str name=Time Elapsed1:53:39.664/str
 str name=Total Requests made to DataSource16276/str
 str name=Total Rows Fetched24237/str
 str name=Total Documents Processed16273/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2011-06-04 11:25:26/str
 /lst

 How can i determine why this is happening and how can I improve this.
During 
all
 our test on the local server before the migration we could index 5
million
 records in 4-5 hrs, but now its taking too long on the live server.

 Regards,
 Rohit






RE: DIH: Exception with Too many connections

2011-05-31 Thread Fuad Efendi
Hi,


There is existing bug in DataImportHandler described (and patched) at
https://issues.apache.org/jira/browse/SOLR-2233
It is not used in a thread safe manner, and it is not appropriately closed 
reopened (why?); and new connection is opened unpredictably. It may cause
Too many connections even for huge SQL-side max_connections.

If you are interested, I can continue work on SOLR-2233. CC: dev@lucene (is
anyone working on DIH improvements?)

Thanks,
Fuad Efendi
http://www.tokenizer.ca/


-Original Message-
From: François Schiettecatte [mailto:fschietteca...@gmail.com] 
Sent: May-31-11 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH: Exception with Too many connections

Hi

You might also check the 'max_user_connections' settings too if you have
that set:

# Maximum number of connections, and per user
max_connections   = 2048
max_user_connections  = 2048

http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html

Cheers

François



 So, if the number of threads in the process list is larger than 
 max_connections, I would get the too many connections error.  Am I 
 thinking the right way?
 



WIKI alerts

2011-05-31 Thread Fuad Efendi
Anyone noticed that it doesn't work? Already 2 weeks

https://issues.apache.org/jira/browse/INFRA-3667

 

I don't receive WIKI change notifications. I CC to 'Apache Wiki'
wikidi...@apache.org

 

Something is bad.

 

 

-Fuad

 

 



RE: Solr memory consumption

2011-05-31 Thread Fuad Efendi
It could be environment specific (specific of your top command
implementation, OS, etc)

I have on CentOS 2986m virtual memory showing although -Xmx2g

You have 10g virtual although -Xmx6g 

Don't trust it too much... top command may count OS buffers for opened
files, network sockets, JVM DLLs itself, etc (which is outside Java GC
responsibility); additionally to JVM memory... it counts all memory, not
sure... if you don't have big values for 99.9%wa (which means WAIT I/O -
disk swap usage) everyhing is fine...



-Original Message-
From: Denis Kuzmenok 
Sent: May-31-11 4:18 PM
To: solr-user@lucene.apache.org
Subject: Solr memory consumption

I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see this
in top after 6-8 hours and still raising:

17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
-Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar
start.jar
  
Are there any ways to limit memory for sure?

Thanks



RE: Solr vs ElasticSearch

2011-05-31 Thread Fuad Efendi
Interesting wordings:
we want real-time search, we want simple multi-tenancy, and we want a
solution that is built for the cloud

And later,
 built on top of Lucene.

Is that possible? :)
(what does that mean real time search anyway... and what is cloud?)

community is growing!

P.S.
I never used Elastic Search, but I used Compass before moving to SOLR. And
Compass uses wordings like as real-time *transactional* search. Yes, it's
good and it has own use case (small databases, reduced development time,
junior-level staff, single-JVM environment)

I'd consider requirements at first, then will see which tool simplifies my
task (fulfils most requirements). It could be Elastic, or SOLR, or Compass,
or direct Lucene, or even SQL, SequenceFile, SQL, in-memory TreeSet, and
etc. Also depends on requirements, budget, teamskills.


-Original Message-
From: Mark 
Sent: May-31-11 10:33 PM
To: solr-user@lucene.apache.org
Subject: Solr vs ElasticSearch

I've been hearing more and more about ElasticSearch. Can anyone give me a
rough overview on how these two technologies differ. What are the
strengths/weaknesses of each. Why would one choose one of the other?

Thanks



Re: Solr vs ElasticSearch

2011-05-31 Thread Fuad Efendi
Nice article... 2 ms better than 20 ms, but in another chart 50 seconds are not 
as good as 3 seconds... Sorry for my vision...

SOLR pushed into Lucene Core huge amount of performance improvements...

Sent on the TELUS Mobility network with BlackBerry

-Original Message-
From: Shashi Kant sk...@sloan.mit.edu
Sender: shashi@gmail.com
Date: Wed, 1 Jun 2011 01:01:51 
To: solr-user@lucene.apache.org
Reply-To: solr-user@lucene.apache.org
Subject: Re: Solr vs ElasticSearch

Here is a very interesting comparison

http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/


 -Original Message-
 From: Mark
 Sent: May-31-11 10:33 PM
 To: solr-user@lucene.apache.org
 Subject: Solr vs ElasticSearch

 I've been hearing more and more about ElasticSearch. Can anyone give me a
 rough overview on how these two technologies differ. What are the
 strengths/weaknesses of each. Why would one choose one of the other?

 Thanks




Re: Out of memory error

2010-12-07 Thread Fuad Efendi
Related: SOLR-846

Sent on the TELUS Mobility network with BlackBerry

-Original Message-
From: Erick Erickson erickerick...@gmail.com
Date: Tue, 7 Dec 2010 08:11:41 
To: solr-user@lucene.apache.org
Reply-To: solr-user@lucene.apache.org
Subject: Re: Out of memory error

Have you seen this page? http://wiki.apache.org/solr/DataImportHandlerFaq
http://wiki.apache.org/solr/DataImportHandlerFaqSee especially batchsize,
but it looks like you're already on to that.

Do you have any idea how big the records are in the database? You might
try adjusting the rambuffersize down, what is it at now?

In general, what are our Solr commit options?

Does anything get to Solr or is the OOM when the SQL is executed?
The first question to answer is whether you index anything at all...

There's a little-know DIH debug page you can access at:
.../solr/admin/dataimport.jsp that might help, and progress can be monitored
at:
.../solr/dataimport

DIH can be interesting, you get finer control with SolrJ and a direct
JDBC connection. If you don't get anywhere with DIH.

Scattergun response, but things to try...

Best
Erick

On Tue, Dec 7, 2010 at 12:03 AM, sivaprasad sivaprasa...@echidnainc.comwrote:


 Hi,

 When i am trying to import the data using DIH, iam getting Out of memory
 error.The below are the configurations which i have.

 Database:Mysql
 Os:windows
 No Of documents:15525532
 In Db-config.xml i made batch size as -1

 The solr server is running on Linux machine with tomcat.
 i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M

 Can anybody has idea, where the things are going wrong?

 Regards,
 JS


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Out of memory error

2010-12-06 Thread Fuad Efendi
Batch size -1??? Strange but could be a problem. 

Note also you can't provide parameters to default startup.sh command; you 
should modify setenv.sh instead

--Original Message--
From: sivaprasad
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Out of memory error
Sent: Dec 7, 2010 12:03 AM


Hi,

When i am trying to import the data using DIH, iam getting Out of memory
error.The below are the configurations which i have.

Database:Mysql
Os:windows
No Of documents:15525532
In Db-config.xml i made batch size as -1

The solr server is running on Linux machine with tomcat.
i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M

Can anybody has idea, where the things are going wrong?

Regards,
JS


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sent on the TELUS Mobility network with BlackBerry

Re: Dataimporthandler crashed raidcontroller

2010-11-04 Thread Fuad Efendi
I experienced similar problems. It was because we didn't perform load stress 
tests properly, before going to production. Nothing is forever, replace 
controller, change hardware vendor, maintain low temperature inside a rack. 
Thanks
--Original Message--
From: Robert Gründler
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Dataimporthandler crashed raidcontroller
Sent: Nov 4, 2010 7:21 PM

Hi all,

we had a severe problem with our raidcontroller on one of our servers today 
during importing a table with ~8 million rows into a solr index. After 
importing about 4 million
documents, our server shutdown, and failed to restart due to a corrupt raid
disk. 

The Solr data import was the only heavy process running on that machine during
the crash.

Has anyone experienced hdd/raid-related problems during indexing large sql 
databases into solr?


thanks!


-robert

 




Sent on the TELUS Mobility network with BlackBerry

RE: Need feedback on solr security

2010-02-17 Thread Fuad Efendi
 You could set a firewall that forbid any connection to your Solr's
 server port to everyone, except the computer that host your application
 that connect to Solr.
 So, only your application will be able to connect to Solr.


I believe firewalling is the only possible solution since SOLR doesn't use
cookies/sessionIDs

However, 'firewall' can be implemented as an Apache HTTPD Server (or any
other front-end configured to authenticate users). (you can even configure
CISCO PIX (etc.) Firewall to authenticate users.)

HTTPD is easiest, but I haven't tried.

But again, if your use case is many users, many IPs you need good
front-end (web application); if it is not the case - just restrict access to
specific IP.


-Fuad
http://www.tokenizer.ca





RE: Need feedback on solr security

2010-02-17 Thread Fuad Efendi
 For Making by solr admin password protected,
  I had used the Path Based Authentication form
 http://wiki.apache.org/solr/SolrSecurity.
 In this way my admin area,search,delete,add to index is protected.But
 Now
 when I make solr authenticated then for every update/delete from the
 fornt
 end is blocked without authentication.


Correct, SOLR doesn't use HTTP Session (Session Cookies, Session IDs); and
it shouldn't do that.

If you have such use case (Authenticated Session) you will need front-end
web application.




Range Queries, Geospatial

2010-02-16 Thread Fuad Efendi
Hi,


I've read very interesting interview with Ryan,
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and
-Videos/Interview-Ryan-McKinley

Another finding is 
https://issues.apache.org/jira/browse/SOLR-773
(lucene/contrib/spatial)

Is there any more staff going on for SOLR 1.5 (and existing SOLR 1.4)?

I need filtering on 2-dimension like x:[1 TO 10100] y:[7900 TO 8000]
(that's why I need SOLR:)))

Any thoughts? I'd love to implement something quick-simple-efficient if it
doesn't exist yet, like R-Tree (http://en.wikipedia.org/wiki/R-tree), or
Geohash (http://en.wikipedia.org/wiki/Geohash)

I haven't tried Local Lucene and SOLR-773 yet.

Thanks!





RE: For caches, any reason to not set initialSize and size to the same value?

2010-02-12 Thread Fuad Efendi
Funny, Arrays.copy() for HashMap... but something similar...

Anyway, I use same values for initial size and max size, to be safe... and
to have OOP at startup :) 



 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: February-12-10 6:55 PM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject: RE: For caches, any reason to not set initialSize and size to
 the same value?
 
 I always use initial size = max size,
 just to avoid Arrays.copyOf()...
 
 Initial (default) capacity for HashMap is 16, when it is not enough -
 array
 copy to new 32-element array, then to 64, ...
 - too much wasted space! (same for ConcurrentHashMap)
 
 Excuse me if I didn't understand the question...
 
 -Fuad
 http://www.tokenizer.ca
 
 
 
  -Original Message-
  From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
  Seeley
  Sent: February-12-10 6:30 PM
  To: solr-user@lucene.apache.org
  Subject: Re: For caches, any reason to not set initialSize and size to
  the same value?
 
  On Fri, Feb 12, 2010 at 5:23 PM, Jay Hill jayallenh...@gmail.com
  wrote:
   If I've done a lot of research and have a very good idea of where my
  cache
   sizes are having monitored the stats right before commits, is there
  any
   reason why I wouldn't just set the initialSize and size counts to
 the
  same
   values? Is there any reason to set a smaller initialSize if I know
  reliably
   that where my limit will almost always be?
 
  Probably not much...
  The only savings will be the 8 bytes (on a 64 bit proc) per unused
  array slot (in the HashMap).
  Maybe we should consider removing the initialSize param from the
  example config to reduce the amount of stuff a user needs to think
  about.
 
  -Yonik
  http://www.lucidimagination.com
 





RE: expire/delete documents

2010-02-12 Thread Fuad Efendi
 or since you specificly asked about delteing anything older
 then X days (in this example i'm assuming x=7)...
 
   deletequerycreateTime:[NOW-7DAYS TO *]/query/delete

createTime:[* TO NOW-7DAYS]






RE: analysing wild carded terms

2010-02-10 Thread Fuad Efendi
 hello *, quick question, what would i have to change in the query
 parser to allow wildcarded terms to go through text analysis?

I believe it is illogical. wildcarded terms will go through terms
enumerator.




RE: Solr integration with document management systems

2010-02-06 Thread Fuad Efendi

SOLR doesn't come with such things...
Look at www.liferay.com; they have plugin for SOLR (in SVN trunk) so that
all documents / assets can be automatically indexed by SOLR (and you have
full freedom with defining specific SOLR schema settings); their portlets
support WebDAV, and Open Office looks almost like Sharepoint
-Fuad


 -Original Message-
 From: ST ST [mailto:stst2...@gmail.com]
 Sent: February-06-10 6:46 PM
 To: solr-user@lucene.apache.org
 Subject: Solr integration with document management systems
 
 Folks,
 
 Does Solr 1.4 come with integration with existing document management
 systems ?
 
 Are there any other open source projects based on Solr which provide
 this
 capability ?
 
 Thanks




RE: Fundamental questions of how to build up solr for huge portals

2010-02-05 Thread Fuad Efendi
 - whats the best way to use solr to get the best performance for an huge
 portal with 5000 users that might expense fastly?

5000 users:
200 TPS, for instance, equal to 1200 concurrent users (each user makes 1
request per minute); so that single SOLR instance is more than enough.

Why 200TPS? It is bottom line, for fuzzy search (I recently improved it).

In real life, real hardware, 1000TPS (using caching, not frequently using
fuzzy search, etc.) which is equal to 6 concurrent users, subsequently
to more than 600,000 of total users.

The rest depends on your design...

If you have separate portals A, B, C - create a field with values A, B, C.

Liferay Portal nicely integrates with SOLR... each kind of Portlet object
(Forum Post, Document, Journal Article, etc.) can implement searchable and
be automatically indexed. But Liferay is Java-based, JSR-168, JSR-286 (and
it supports PHP-portlets, but I never tried).

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay


 -Original Message-
 From: Peter [mailto:zarato...@gmx.net]
 Sent: January-16-10 10:17 AM
 To: solr-user@lucene.apache.org
 Subject: Fundamental questions of how to build up solr for huge portals
 
 Hello!
 
 Our team wants to use solr for an community portal built up out of 3 and
 more sub portals. We are unsure in which way we sould build up the whole
 architecture, because we have more than one portal and we want to make
 them all connected and searchable by solr. Could some experts help us on
 these questions?
 
 - whats the best way to use solr to get the best performance for an huge
 portal with 5000 users that might expense fastly?
 - which client to use (Java,PHP...)? Now the portal is almost PHP/MySQL
 based. But we want to make solr as best as it could be in all ways
 (performace, accesibility, way of good programming, using the whole
 features of lucene - like tagging, facetting and so on...)
 
 
 We are thankful of every suggestions :)
 
 Thanks,
 Peter




RE: Solr response extremely slow

2010-02-04 Thread Fuad Efendi
'!'
:)))

Plus, FastLRUCache (previous one was synchronized)
(and of course warming-up time) := start complains after ensuring there are
no complains :)
(and of course OS needs time to cache filesystem blocks, and Java HotSpot,
... - few minutes at least...)

 On Feb 3, 2010, at 1:38 PM, Rajat Garg wrote:
  Solr Specification Version: 1.3.0
  Solr Implementation Version: 1.3.0 694707 - grantingersoll -
  2008-09-12
  11:06:47
 
 There's the problem right there... that grantingersoll guy :)
 
 (kidding)
 
 
 Sounds like you're just hitting cache warming which can take a while.
 
 Have you tried Solr 1.4?  Faceting performance, for example, is
 dramatically improved, among many other improvements.
 
   Erik





RE: fuzzy matching / configurable distance function?

2010-02-04 Thread Fuad Efendi

Levenstein algo is currently hardcoded (FuzzyTermEnum class) in Lucene 2.9.1
and 3.0...
There are samples of other distance in contrib folder
If you want to play with distance, check
http://issues.apache.org/jira/browse/LUCENE-2230
It works if distance is integer and follows metric space axioms:
D(a,b)=D(b,a)
D(a,b)+D(b,c)=D(a,c)


Probably SOLR can provide more freedom with plugged-in distances...

-Fuad


 -Original Message-
 From: Joe Calderon [mailto:calderon@gmail.com]
 Sent: February-04-10 2:34 PM
 To: solr-user@lucene.apache.org
 Subject: fuzzy matching / configurable distance function?
 
 is it possible to configure the distance formula used by fuzzy
 matching? i see there are other under the function query page under
 strdist but im wondering if they are applicable to fuzzy matching
 
 thx much
 
 
 --joe




SOLR Performance Tuning: Fuzzy Search

2010-02-03 Thread Fuad Efendi
I was lucky to contribute an excellent solution: 
http://issues.apache.org/jira/browse/LUCENE-2230

Even 2nd edition of Lucene in Action advocates to use fuzzy search only in
exceptional cases.


Another solution would be 2-step indexing (it may work for many use cases),
but it is not spellchecker

1. Create a regular index
2. Create a dictionary of terms
3. For each term, find nearest terms (for instance, stick with distance=2)
4. Use copyField in SOLR, or smth similar to synonym dictionary; or, for
instance, generate specific Query Parser...
5. Of course, custom request handler
and etc.

It may work well (but only if query contains term from dictionary; it can't
work as a spellchecker)

Combination 2 algos can boost performance extremely...


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search






RE: Comparison of Solr with Sharepoint Search

2010-01-26 Thread Fuad Efendi

I can only tell that Liferay Portal (WebDAV) Document Library Portlet has
same functionality as Sharepoint (it has even /servlet/ URL with suffix
'/sharepoint'); Liferay also has plugin (web-hook) for SOLR (it has generic
search wrapper; any kind of search service provider can be hooked in
Liferay)
All assets (web content, message board posts, documents, and etc.) can
implement indexing interface and get indexed (Lucene, SOLR, etc)

So far, it is the best approach. You can enjoy configuring SOLR
analyzers/fields/language/stemmers/dictionaries/... You can't do it with
MS-Sharepoint (or, for instance, their close competitors Alfresco)!!!

-Fuad
http://www.tokenizer.ca


 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: January-26-10 7:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Comparison of Solr with Sharepoint Search
 
 
 : Has anyone done a functionality comparison of Solr with
 Sharepoint/Fast
 : Search?
 
 there's been some discussion on this over the years comparing Solr with
 FAST if you go looking for it...
 
 http://old.nabble.com/SOLR-X-FAST-to14284618.html
 http://old.nabble.com/Replacing-FAST-functionality-at-sesam.no-
 td19186109.html
 http://old.nabble.com/Experiences-from-migrating-from-FAST-to-Solr-
 td26371613.html
 http://sesat.no/moving-from-fast-to-solr-review.html
 
 ...i have no idea about Sharepoint Search (isn't that actaully a
 seperate
 system? ... Microsoft Search Server or something?)
 
 
 -Hoss





RE: Solr vs. Compass

2010-01-25 Thread Fuad Efendi
  Why to embed indexing as a transaction dependency? Extremely weird
 idea.
 There is nothing weird about different use cases requiring different
 approaches
 
 If you're just thinking documents and text search ... then its less of
 an issue.
 If you have an online application where the indexing is being used to
 drive certain features (not just search), then the transactionality is
 quite useful.


I mean:
- Primary Key Constraint in RDBMS is not the same as an index
- Index in RDBMS: data is still searchable, even if we don't have index

Are you sure that index in RDBMS is part of transaction in current
implementations of Oracle, IBM, SUN? I never heard such staff, there are no
such requirements for transactions. I am talking about transactions and
referential integrity, and not about indexed non-tokenized single-valued
field Social Insurance Number. It could be done asynchronously outside of
transaction, I can't imagine use case when it must be done inside
transaction / failing transaction when it can't be done.

Primary Key Constraint is different use case, it is not necessarily
indexing of data. Especially for Hibernate where we mostly use surrogate
auto-generated keys.

 
-Fuad




RE: Solr vs. Compass

2010-01-25 Thread Fuad Efendi

  Even if commit takes 20 minutes?
 I've never seen a commit take 20 minutes... (anything taking that long
 is broken, perhaps in concept)


index merge can take from few minutes to few hours. That's why nothing can
beat SOLR Master/Slave and sharding for huge datasets. And reopening of
IndexReader after each commit may take at least few seconds (although
depends on usage patterns).

IndexReader or IndexSearcher will only see the index as of the point in
time that it was opened. Any changes committed to the index after the
reader was opened are not visible until the reader is re-opened.


I am wondering how Compass opens new instance of IndexReader (after each
commit!) - is it really implemented? I can't believe! It will work probably
fine for small datasets (less than 100k), and 1 TPD (transaction-per-day)...
 

Very expensive and unnatural ACID...


-Fuad




Is there limit on size of query string?

2010-01-22 Thread Fuad Efendi
Is there limit on size of query string?

Looks like I have exceptions when query string is higher than 400 characters
(average)

Thanks!




RE: Solr vs. Compass

2010-01-22 Thread Fuad Efendi
Yes, transactional, I tried it: do we really need transactional? Even if 
commit takes 20 minutes?
It's their selling point nothing more.
HBase is not transactional, and it has specific use case; each tool has 
specific use case... in some cases Compass is the best!

Also, note that Compass (Hibernate) ((RDBMS)) use specific business domain 
model terms with relationships; huge overhead to convert relational into 
object-oriented (why for? Any advantages?)... Lucene does it 
behind-the-scenes: you don't have to worry that field USA (3 characters) is 
repeated in few millions documents, and field Canada (6 characters) in 
another few; no any relational, it's done automatically without any 
Compass/Hibernate/Table(s)


Don't think relational.

I wrote this 2 years ago:
http://www.theserverside.com/news/thread.tss?thread_id=50711#272351


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/


 -Original Message-
 From: Uri Boness [mailto:ubon...@gmail.com]
 Sent: January-21-10 11:35 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr vs. Compass
 
 In addition, the biggest appealing feature in Compass is that it's
 transactional and therefore integrates well with your infrastructure
 (Spring/EJB, Hibernate, JPA, etc...). This obviously is nice for some
 systems (not very large scale ones) and the programming model is clean.
 On the other hand, Solr scales much better and provides a load of
 functionality that otherwise you'll have to custom build on top of
 Compass/Lucene.
 
 Lukáš Vlček wrote:
  Hi,
 
  I think that these products do not compete directly that much, each
 fit
  different business case. Can you tell us more about our specific
 situation?
  What do you need to search and where your data is? (DB, Filesystem,
 Web
  ...?)
 
  Solr provides some specific extensions which are not supported
 directly by
  Lucene (faceted search, DisMax... etc) so if you need these then your
 bet on
  Compass might not be perfect. On the other hand if you need to index
  persistent Java objects then Compass fits perfectly into this scenario
 (and
  if you are using Spring and JPA then setting up search can be matter
 of
  several modifications to configuration and annotations).
 
  Compass is more Hibernate search competitor (but Compass is not
 limited to
  Hibernate only and is not even limited to DB content as well).
 
  Regards,
  Lukas
 
 
  On Thu, Jan 21, 2010 at 4:40 PM, Ken Lane (kenlane)
 kenl...@cisco.comwrote:
 
 
  We are knee-deep in a Solr project to provide a web services layer
  between our Oracle DB's and a web front end to be named later  to
  supplement our numerous Business Intelligence dashboards. Someone
 from a
  peer group questioned why we selected Solr rather than Compass to
 start
  development. The real reason is that we had not heard of Compass
 until
  that comment. Now I need to come up with a better answer.
 
 
 
  Does anyone out there have experience in both approaches who might be
  able to give a quick compare and contrast?
 
 
 
  Thanks in advance,
 
  Ken
 
 
 
 
 




RE: Solr vs. Compass

2010-01-22 Thread Fuad Efendi
Of course, I understand what transaction means; have you guys been thinking 
some about what may happen if we transfer $123.45 from one banking account to 
another banking account, and MySQL forgets to index decimal during 
transaction, or DBA was weird and forgot to create an index? Absolutely nothing.

Why to embed indexing as a transaction dependency? Extremely weird idea. But 
I understand some selling points...


SOLR: it is faster than Lucene. Filtered queries run faster than traditional 
AND queries! And this is real selling point.



Thanks,

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search


 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: January-22-10 11:23 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr vs. Compass
 
 Yes, transactional, I tried it: do we really need transactional?
 Even if commit takes 20 minutes?
 It's their selling point nothing more.
 HBase is not transactional, and it has specific use case; each tool has
 specific use case... in some cases Compass is the best!
 
 Also, note that Compass (Hibernate) ((RDBMS)) use specific business
 domain model terms with relationships; huge overhead to convert
 relational into object-oriented (why for? Any advantages?)... Lucene
 does it behind-the-scenes: you don't have to worry that field USA (3
 characters) is repeated in few millions documents, and field Canada (6
 characters) in another few; no any relational, it's done automatically
 without any Compass/Hibernate/Table(s)
 
 
 Don't think relational.
 
 I wrote this 2 years ago:
 http://www.theserverside.com/news/thread.tss?thread_id=50711#272351
 
 
 Fuad Efendi
 +1 416-993-2060
 http://www.tokenizer.ca/
 
 
  -Original Message-
  From: Uri Boness [mailto:ubon...@gmail.com]
  Sent: January-21-10 11:35 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr vs. Compass
 
  In addition, the biggest appealing feature in Compass is that it's
  transactional and therefore integrates well with your infrastructure
  (Spring/EJB, Hibernate, JPA, etc...). This obviously is nice for some
  systems (not very large scale ones) and the programming model is
 clean.
  On the other hand, Solr scales much better and provides a load of
  functionality that otherwise you'll have to custom build on top of
  Compass/Lucene.
 
  Lukáš Vlček wrote:
   Hi,
  
   I think that these products do not compete directly that much, each
  fit
   different business case. Can you tell us more about our specific
  situation?
   What do you need to search and where your data is? (DB, Filesystem,
  Web
   ...?)
  
   Solr provides some specific extensions which are not supported
  directly by
   Lucene (faceted search, DisMax... etc) so if you need these then
 your
  bet on
   Compass might not be perfect. On the other hand if you need to index
   persistent Java objects then Compass fits perfectly into this
 scenario
  (and
   if you are using Spring and JPA then setting up search can be matter
  of
   several modifications to configuration and annotations).
  
   Compass is more Hibernate search competitor (but Compass is not
  limited to
   Hibernate only and is not even limited to DB content as well).
  
   Regards,
   Lukas
  
  
   On Thu, Jan 21, 2010 at 4:40 PM, Ken Lane (kenlane)
  kenl...@cisco.comwrote:
  
  
   We are knee-deep in a Solr project to provide a web services layer
   between our Oracle DB's and a web front end to be named later  to
   supplement our numerous Business Intelligence dashboards. Someone
  from a
   peer group questioned why we selected Solr rather than Compass to
  start
   development. The real reason is that we had not heard of Compass
  until
   that comment. Now I need to come up with a better answer.
  
  
  
   Does anyone out there have experience in both approaches who might
 be
   able to give a quick compare and contrast?
  
  
  
   Thanks in advance,
  
   Ken
  
  
  
  
  
 





RE: SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree

2010-01-22 Thread Fuad Efendi
http://issues.apache.org/jira/browse/LUCENE-2230
Enjoy!


 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: January-19-10 11:32 PM
 To: solr-user@lucene.apache.org
 Subject: SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree
 
 Hi,
 
 
 I am wondering: will SOLR or Lucene use caches for fuzzy searches? I
 mean
 per-term caching or something, internal to Lucene, or may be SOLR (SOLR
 may
 use own query parser)...
 
 Anyway, I implemented BK-Tree and playing with it right now, I altered
 FuzzyTermEnum class of Lucene...
 http://en.wikipedia.org/wiki/BK-tree
 
 - it seems performance of fuzzy searches boosted at least hundred times,
 but
 I need to do more tests... repeated similar (slightly different) queries
 run
 with better performance, probably because of OS-level file caching...
 but it
 could be that of BK-Tree distance! (although I need to use classic int
 instead of float distance by Lucene/Levenstein etc.)
 
 Thanks,
 Fuad Efendi
 +1 416-993-2060
 http://www.tokenizer.ca/
 Data Mining, Vertical Search
 
 
 





SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree

2010-01-19 Thread Fuad Efendi
Hi,


I am wondering: will SOLR or Lucene use caches for fuzzy searches? I mean
per-term caching or something, internal to Lucene, or may be SOLR (SOLR may
use own query parser)...

Anyway, I implemented BK-Tree and playing with it right now, I altered
FuzzyTermEnum class of Lucene...
http://en.wikipedia.org/wiki/BK-tree

- it seems performance of fuzzy searches boosted at least hundred times, but
I need to do more tests... repeated similar (slightly different) queries run
with better performance, probably because of OS-level file caching... but it
could be that of BK-Tree distance! (although I need to use classic int
instead of float distance by Lucene/Levenstein etc.)

Thanks,
Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/
Data Mining, Vertical Search






RE: SOLR: Replication

2010-01-03 Thread Fuad Efendi
Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's
environmental issue; 100Mbps switched should do 10 times faster (current
replica speed is 1Mbytes/sec)


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: January-03-10 10:03 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR: Replication
 
 On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi f...@efendi.ca wrote:
  I tried... I set APR to improve performance... server is slow while
 replica;
  but top shows only 1% of I/O wait... it is probably environment
 specific;
 
 So you're saying that stock tomcat (non-native APR) was also 10 times
 slower?
 
  but the same happened in my home-based network, rsync was 10 times
 faster...
  I don't know details of HTTP-replica, it could be base64 or something
 like
  that; RAM-buffer, flush to disk, etc.
 
 The HTTP replication is using binary.
 If you look here, it was benchmarked to be nearly as fast as rsync:
 http://wiki.apache.org/solr/SolrReplication
 
 It does do a fsync to make sure that the files are on disk after
 downloading, but that shouldn't make too much difference.
 
 -Yonik
 http://www.lucidimagination.com




SOLR: Replication

2010-01-02 Thread Fuad Efendi
I used RSYNC before, and 20Gb replica took less than an hour (20-40
minutes); now, HTTP, and it takes 5-6 hours...
Admin screen shows 952Kb/sec average speed; 100Mbps network, full-duplex; I
am using Tomcat Native for APR. 10x times slow...
-Fuad
http://www.tokenizer.ca





RE: SOLR: Replication

2010-01-02 Thread Fuad Efendi

Hi Yonik,

I tried... I set APR to improve performance... server is slow while replica;
but top shows only 1% of I/O wait... it is probably environment specific;
but the same happened in my home-based network, rsync was 10 times faster...
I don't know details of HTTP-replica, it could be base64 or something like
that; RAM-buffer, flush to disk, etc.
-Fuad


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: January-02-10 5:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR: Replication
 
 On Sat, Jan 2, 2010 at 5:48 PM, Fuad Efendi f...@efendi.ca wrote:
  I used RSYNC before, and 20Gb replica took less than an hour (20-40
  minutes); now, HTTP, and it takes 5-6 hours...
  Admin screen shows 952Kb/sec average speed; 100Mbps network, full-
 duplex; I
  am using Tomcat Native for APR. 10x times slow...
 
 Hmmm, did you try w/o native APR?
 
 -Yonik
 http://www.lucidimagination.com




SOLR: Portlet (Plugin) for Lifeay Portal

2009-12-25 Thread Fuad Efendi
SOLR Users
==


I am in the middle of development of generic (configurable) _portlet_
(JSR-286) for Liferay Portal (MIT-like license) which I am going to share,

Have a look at (my profile) powered by Liferay Portal:
http://www.tokenizer.ca/web/guru :smile:
Home page: http://www.tokenizer.ca/

http://www.liferay.com - native multi-hosting support (I can power multiple
DNS with single Tomcat-Liferay instance; I can even assign DNS to personal
profiles)

BTW, Liferay Portal has generic wrapper around Lucene, and recently SOLR!
All content (including Articles, BLOGs, Documents, Pages, WIKIs, Forum
Posts) is automatically indexed. Having separate SOLR definitely helps:
instead of hardcoding (with Lucene) we can now intelligently manage stop
words, stemming, language settings, and more.


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search






SOLR Performance Tuning: Pagination

2009-12-24 Thread Fuad Efendi
I used pagination for a while till found this...


I have filtered query ID:[* TO *] returning 20 millions results (no
faceting), and pagination always seemed to be fast. However, fast only with
low values for start=12345. Queries like start=28838540 take 40-60 seconds,
and even cause OutOfMemoryException.

I use highlight, faceting on nontokenized Country field, standard handler.


It even seems to be a bug...


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search






RE: SOLR Performance Tuning: Pagination

2009-12-24 Thread Fuad Efendi
Grant, Eric, Walter, and SOLR,

Thank you so much for very prompt responses (with links!)

From time to time I try to share...


Happy Holidays!!!




 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org]
 Sent: December-24-09 1:51 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR Performance Tuning: Pagination
 
 Some bots will do that, too. Maybe badly written ones, but we saw that at
 Netflix. It was causing search timeouts just before a peak traffic period,
 so we set a page limit in the front end, something like 200 pages.
 
 It makes sense for that to be very slow, because a request for hit
 28838540 means that Solr has to calculate the relevance for 28838540 + 10
 documents.
 
 Fuad: Why are you benchmarking this? What user is looking at 20M
 documents?
 
 wunder
 
 On Dec 24, 2009, at 10:44 AM, Erik Hatcher wrote:
 
 
  On Dec 24, 2009, at 11:36 AM, Walter Underwood wrote:
  When do users do a query like that? --wunder
 
  Well, SolrEntityProcessor users do :)
 
   http://issues.apache.org/jira/browse/SOLR-1499
   (which by the way I plan on polishing and committing over the holidays)
 
  Erik
 
 
 
 
  On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote:
 
  I used pagination for a while till found this...
 
 
  I have filtered query ID:[* TO *] returning 20 millions results (no
  faceting), and pagination always seemed to be fast. However, fast only
 with
  low values for start=12345. Queries like start=28838540 take 40-60
 seconds,
  and even cause OutOfMemoryException.
 
  I use highlight, faceting on nontokenized Country field, standard
 handler.
 
 
  It even seems to be a bug...
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 
 
 





RE: SOLR Performance Tuning: Pagination

2009-12-24 Thread Fuad Efendi

Not users... robots! Slurp/Yahoo, Googlebot, etc.

I had friendly URLs for query with filters like http://.../USA/ showing all
documents from SOLR with country=USA, with pagination; I disabled it now.
But URLs like http://.../?q=USA are still dangerous, I need to limit
pagination programmatically.



 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org]
 Sent: December-24-09 11:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR Performance Tuning: Pagination
 
 When do users do a query like that? --wunder
 
 On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote:
 
  I used pagination for a while till found this...
 
 
  I have filtered query ID:[* TO *] returning 20 millions results (no
  faceting), and pagination always seemed to be fast. However, fast only
 with
  low values for start=12345. Queries like start=28838540 take 40-60
 seconds,
  and even cause OutOfMemoryException.
 
  I use highlight, faceting on nontokenized Country field, standard
 handler.
 
 
  It even seems to be a bug...
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 





RE: SOLR Performance Tuning: Pagination

2009-12-24 Thread Fuad Efendi
Hi Walter, you are right, it were mostly robots (Googlebot, Yahoo/Slurp,
etc);

I have friendly URLs like 
http://www.tokenizer.org/USA/?page=7 (30mlns docs, 3mlns pages)
http://www.tokenizer.org/www.newegg.com/
http://www.tokenizer.org/www.newegg.com/?sort=linkdir=ascq=Opteron

And even this:
http://www.tokenizer.org/AMD/Opteron/8350/

I disabled processing for URLs with no query parameter (empty results); but
I should really limit pagination programmatically... fortunately
http://www.tokenizer.org/?q=USA returns 50k documents (search doesn't use
Country field). But some queries may return huge nuber of documents
(better is to tune  stop-word list)

-Fuad


 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org]
 Sent: December-24-09 1:51 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SOLR Performance Tuning: Pagination
 
 Some bots will do that, too. Maybe badly written ones, but we saw that at
 Netflix. It was causing search timeouts just before a peak traffic period,
 so we set a page limit in the front end, something like 200 pages.
 
 It makes sense for that to be very slow, because a request for hit
 28838540 means that Solr has to calculate the relevance for 28838540 + 10
 documents.
 
 Fuad: Why are you benchmarking this? What user is looking at 20M
 documents?
 
 wunder
 
 On Dec 24, 2009, at 10:44 AM, Erik Hatcher wrote:
 
 
  On Dec 24, 2009, at 11:36 AM, Walter Underwood wrote:
  When do users do a query like that? --wunder
 
  Well, SolrEntityProcessor users do :)
 
   http://issues.apache.org/jira/browse/SOLR-1499
   (which by the way I plan on polishing and committing over the holidays)
 
  Erik
 
 
 
 
  On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote:
 
  I used pagination for a while till found this...
 
 
  I have filtered query ID:[* TO *] returning 20 millions results (no
  faceting), and pagination always seemed to be fast. However, fast only
 with
  low values for start=12345. Queries like start=28838540 take 40-60
 seconds,
  and even cause OutOfMemoryException.
 
  I use highlight, faceting on nontokenized Country field, standard
 handler.
 
 
  It even seems to be a bug...
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 
 
 





RE: SOLR Performance Tuning: Disable INFO Logging.

2009-12-21 Thread Fuad Efendi
 Can you quickly explain what you did to disable INFO-Level?
 
 I am from a PHP background and am not so well versed in Tomcat or
 Java.  Is this a section in solrconfig.xml or did you have to edit
 Solr Java source and recompile?


1. Create a file called logging.properties with following content (I created it 
in /home/tomcat/solr folder):

.level=INFO
handlers= java.util.logging.ConsoleHandler, java.util.logging.FileHandler

java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.FileHandler.level = INFO

java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.ConsoleHandler.level = ALL

org.apache.solr.level=SEVERE


2. Modify file tomcat_installation/bin/catalina.sh to include following (as a 
first line in script):

JAVA_OPTS=... ... ... 
-Djava.util.logging.config.file=/home/tomcat/solr/logging.properties

(this line may include more parameters such as -Xmx8196m for memory, 
-Dfile.encoding=UTF8 -Dsolr.solr.home=/home/tomcat/solr 
-Dsolr.data.dir=/home/tomcat/solr for SOLR, etc.)


With these settings, SOLR (and Tomcat) will use standard Java 5/6 logging 
capabilities. Log output will default to standard /logs folder of Tomcat.

You may find additional logging configuration settings by google for Java 5 
Logging etc.


 
 
 2009/12/20 Fuad Efendi f...@efendi.ca:
  After researching how to configure default SOLR  Tomcat logging, I
 finally
  disabled INFO-level for SOLR.
 
  And performance improved at least 7 times!!! ('at least 7' because I
  restarted server 5 minutes ago; caches are not prepopulated yet)
 
  Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8%
 I/O
  wait whenever top commands shows SOLR on top.
 
  Now, I have 50ms-100ms in average (total response time logged by HTTPD).
 
 
  P.S.
  Of course, I am limited in RAM, and I use slow SATA... server is
 moderately
  loaded, 5-10 requests per second.
 
 
  P.P.S.
  And suddenly synchronous I/O by Java/Tomcat Logger slows down
 performance
  much higher than read-only I/O of Lucene.
 
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 
 
 
 
 


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search





SOLR Performance Tuning: Disable INFO Logging.

2009-12-20 Thread Fuad Efendi
After researching how to configure default SOLR  Tomcat logging, I finally
disabled INFO-level for SOLR.

And performance improved at least 7 times!!! ('at least 7' because I
restarted server 5 minutes ago; caches are not prepopulated yet)

Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8% I/O
wait whenever top commands shows SOLR on top.

Now, I have 50ms-100ms in average (total response time logged by HTTPD).


P.S.
Of course, I am limited in RAM, and I use slow SATA... server is moderately
loaded, 5-10 requests per second.


P.P.S.
And suddenly synchronous I/O by Java/Tomcat Logger slows down performance
much higher than read-only I/O of Lucene.



Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search






RE: SOLR Performance Tuning: Disable INFO Logging.

2009-12-20 Thread Fuad Efendi
We were talking about GC options a lot; don't forget to enclose following
into if (log.isInfoEnabled()):

...
final NamedListObject responseHeader = new SimpleOrderedMapObject();
rsp.add(responseHeader, responseHeader);
NamedList toLog = rsp.getToLog();
//toLog.add(core, getName());
toLog.add(webapp, req.getContext().get(webapp));
toLog.add(path, req.getContext().get(path));
toLog.add(params, { + req.getParamString() + });
handler.handleRequest(req,rsp);
setResponseHeaderValues(handler,req,rsp);
StringBuilder sb = new StringBuilder();
for (int i=0; itoLog.size(); i++) {
String name = toLog.getName(i);
Object val = toLog.getVal(i);
sb.append(name).append(=).append(val).append( );
}
log.info(logid +  sb.toString());...
...


-Fuad


 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: December-20-09 2:54 PM
 To: solr-user@lucene.apache.org
 Subject: SOLR Performance Tuning: Disable INFO Logging.
 
 After researching how to configure default SOLR  Tomcat logging, I
 finally
 disabled INFO-level for SOLR.
 
 And performance improved at least 7 times!!! ('at least 7' because I
 restarted server 5 minutes ago; caches are not prepopulated yet)
 
 Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8% I/O
 wait whenever top commands shows SOLR on top.
 
 Now, I have 50ms-100ms in average (total response time logged by HTTPD).
 
 
 P.S.
 Of course, I am limited in RAM, and I use slow SATA... server is
 moderately
 loaded, 5-10 requests per second.
 
 
 P.P.S.
 And suddenly synchronous I/O by Java/Tomcat Logger slows down performance
 much higher than read-only I/O of Lucene.
 
 
 
 Fuad Efendi
 +1 416-993-2060
 http://www.linkedin.com/in/liferay
 
 Tokenizer Inc.
 http://www.tokenizer.ca/
 Data Mining, Vertical Search
 
 
 





RE: solr stops running periodically

2009-11-16 Thread Fuad Efendi
 By that I mean that the java/tomcat  
 process just disappears. 


I had similar problem when I started Tomcat via SSH, and then I improperly
closed SSH without exit command. 

In some cases (OutOfMemory) memory is not enough to generate log (or CPU can
be overloaded by Garbage Collector to such extent that you will have to wait
few days until LOG will be generated) - but process cant' disappear...

Process can't simply disappear... if it is JVM crash you should see dump
file (you may need to set specific option for JVM to generate dump file in
case of crash)





 -Original Message-
 From: athir nuaimi [mailto:at...@nuaim.com]
 Sent: November-15-09 1:46 PM
 To: solr-user@lucene.apache.org
 Subject: solr stops running periodically
 
  We have 4 machines running solr.  On one of the machines, every 2-3
  days solr stops running.  By that I mean that the java/tomcat
  process just disappears.  If I look at the catalina logs, I see
  normal log entries and then nothing.  There is no shutdown messages
  like you would normally see if you sent a SIGTERM to the process.
 
  Obviously this is a problem. I''m new to solr/java so if there are
  more diagnostic things I can do I'd appreciate any tips/advice.
 
  thanks in advance
  Athir





RE: Lucene FieldCache memory requirements

2009-11-03 Thread Fuad Efendi
Sorry Mike, Mark, I am confused again...

Yes, I need some more memory for processing (while FieldCache is being
loaded), obviously, but it was not main subject...

With StringIndexCache, I have 10 arrays (cardinality of this field is 10)
storing  (int) Lucene Document ID.

 Except: as Mark said, you'll also need transient memory = pointer (4
 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.

Ok, I see it:
  final int[] retArray = new int[reader.maxDoc()];
  String[] mterms = new String[reader.maxDoc()+1];

I can't track right now (limited in time), I think mterms is local variable
and will size down to 0...



So that correct formula is... weird one... if you don't want unexpected OOM
or overloaded GC (WeakHashMaps...):

  [some heap] + [Non-Tokenized_Field_Count] x [maxdoc] x [4 bytes + 8
bytes]

(for 64-bit)


-Fuad


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-03-09 5:00 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 On Mon, Nov 2, 2009 at 9:27 PM, Fuad Efendi f...@efendi.ca wrote:
  I believe this is correct estimate:
 
  C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
 
    same as
  [String1_Document_Count + ... + String10_Document_Count + ...]
  x [4 bytes per DocumentID]
 
 That's right.
 
 Except: as Mark said, you'll also need transient memory = pointer (4
 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded.  After
 it's done being loaded, this sizes down to the number of unique terms.
 
 But, if Lucene did the basic int packing, which really we should do,
 since you only have 10 unique values, with a naive 4 bits per doc
 encoding, you'd only need 1/8th the memory usage.  We could do a bit
 better by encoding more than one document at a time...
 
 Mike




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Any thoughts regarding the subject? I hope FieldCache doesn't use more than
6 bytes per document-field instance... I am too lazy to research Lucene
source code, I hope someone can provide exact answer... Thanks


 Subject: Lucene FieldCache memory requirements
 
 Hi,
 
 
 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different countries);
I
 expect it requires array of (int, long), size of array 100,000,000,
 without any impact of country field length;
 
 it requires 600,000,000 bytes: int is pointer to document (Lucene
document
 ID),  and long is pointer to String value...
 
 Am I right, is it 600Mb just for this country (indexed, non-tokenized,
 non-boolean) field and 100 millions docs? I need to calculate exact
minimum RAM
 requirements...
 
 I believe it shouldn't depend on cardinality (distribution) of field...
 
 Thanks,
 Fuad
 
 
 
 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I am not using Lucene API directly; I am using SOLR which uses Lucene
FieldCache for faceting on non-tokenized fields...
I think this cache will be lazily loaded, until user executes sorted (by
this field) SOLR query for all documents *:* - in this case it will be fully
populated...


 Subject: Re: Lucene FieldCache memory requirements
 
 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).
 
 Mike
 
 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
  Any thoughts regarding the subject? I hope FieldCache doesn't use more
than
  6 bytes per document-field instance... I am too lazy to research Lucene
  source code, I hope someone can provide exact answer... Thanks
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have 100
  millions docs with non-tokenized field country (10 different
countries);
  I
  expect it requires array of (int, long), size of array 100,000,000,
  without any impact of country field length;
 
  it requires 600,000,000 bytes: int is pointer to document (Lucene
  document
  ID),  and long is pointer to String value...
 
  Am I right, is it 600Mb just for this country (indexed,
non-tokenized,
  non-boolean) field and 100 millions docs? I need to calculate exact
  minimum RAM
  requirements...
 
  I believe it shouldn't depend on cardinality (distribution) of field...
 
  Thanks,
  Fuad
 
 
 
 
 
 
 
 




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Thank you very much Mike,

I found it:
org.apache.solr.request.SimpleFacets
...
// TODO: future logic could use filters instead of the fieldcache if
// the number of terms in the field is small enough.
counts = getFieldCacheCounts(searcher, base, field, offset,limit,
mincount, missing, sort, prefix);
...
FieldCache.StringIndex si =
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
final String[] terms = si.lookup;
final int[] termNum = si.order;
...


So that 64-bit requires more memory :)


Mike, am I right here?
[(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
(64-bit JVM)
1.2Gb RAM for this...

Or, may be I am wrong:
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.

[8 bytes (64bit)] x [number of documents (100mlns)]? 
0.8Gb

Kind of Map between String and DocSet, saving 4 bytes... Key is String,
and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
JVM)? I always thought it is (int) documentId...

Am I right?


Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

 Note that for your use case, this is exceptionally wasteful.  
This is probably very common case... I think it should be confirmed by
Lucene developers too... FieldCache is warmed anyway, even when we don't use
SOLR...

 
-Fuad







 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-02-09 6:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 OK I think someone who knows how Solr uses the fieldCache for this
 type of field will have to pipe up.
 
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.  (Each also consume
 negligible (for your case) memory to hold the actual string values).
 
 Note that for your use case, this is exceptionally wasteful.  If
 Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
 then it'd take much fewer bits to reference the values, since you have
 only 10 unique string values.
 
 Mike
 
 On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted (by
  this field) SOLR query for all documents *:* - in this case it will be
fully
  populated...
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
   Any thoughts regarding the subject? I hope FieldCache doesn't use
more
  than
   6 bytes per document-field instance... I am too lazy to research
Lucene
   source code, I hope someone can provide exact answer... Thanks
  
  
   Subject: Lucene FieldCache memory requirements
  
   Hi,
  
  
   Can anyone confirm Lucene FieldCache memory requirements? I have 100
   millions docs with non-tokenized field country (10 different
  countries);
   I
   expect it requires array of (int, long), size of array
100,000,000,
   without any impact of country field length;
  
   it requires 600,000,000 bytes: int is pointer to document (Lucene
   document
   ID),  and long is pointer to String value...
  
   Am I right, is it 600Mb just for this country (indexed,
  non-tokenized,
   non-boolean) field and 100 millions docs? I need to calculate exact
   minimum RAM
   requirements...
  
   I believe it shouldn't depend on cardinality (distribution) of
field...
  
   Thanks,
   Fuad
  
  
  
  
  
  
  
  
 
 
 




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
difference between maxdoc and maxdoc + 1 for such estimate... difference is
between 0.4Gb and 1.2Gb...


So, let's vote ;)

A. [maxdoc] x [8 bytes ~ pointer to String object]

B. [maxdoc] x [8 bytes ~ pointer to Document object]

C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] 
- same as [String1_Document_Count + ... + String10_Document_Count] x [4
bytes ~ DocumentID]

D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]


Please confirm that it is Pointer to Object and not Lucene Document ID... I
hope it is (int) Document ID...





 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 6:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 It also briefly requires more memory than just that - it allocates an
 array the size of maxdoc+1 to hold the unique terms - and then sizes down.
 
 Possibly we can use the getUnuiqeTermCount method in the flexible
 indexing branch to get rid of that - which is why I was thinking it
 might be a good idea to drop the unsupported exception in that method
 for things like multi reader and just do the work to get the right
 number (currently there is a comment that the user should do that work
 if necessary, making the call unreliable for this).
 
 Fuad Efendi wrote:
  Thank you very much Mike,
 
  I found it:
  org.apache.solr.request.SimpleFacets
  ...
  // TODO: future logic could use filters instead of the
fieldcache if
  // the number of terms in the field is small enough.
  counts = getFieldCacheCounts(searcher, base, field,
offset,limit,
  mincount, missing, sort, prefix);
  ...
  FieldCache.StringIndex si =
  FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
  final String[] terms = si.lookup;
  final int[] termNum = si.order;
  ...
 
 
  So that 64-bit requires more memory :)
 
 
  Mike, am I right here?
  [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
  (64-bit JVM)
  1.2Gb RAM for this...
 
  Or, may be I am wrong:
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.
 
 
  [8 bytes (64bit)] x [number of documents (100mlns)]?
  0.8Gb
 
  Kind of Map between String and DocSet, saving 4 bytes... Key is
String,
  and Value is array of 64-bit pointers to Document. Why 64-bit (for
64-bit
  JVM)? I always thought it is (int) documentId...
 
  Am I right?
 
 
  Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
 
 
  Note that for your use case, this is exceptionally wasteful.
 
  This is probably very common case... I think it should be confirmed by
  Lucene developers too... FieldCache is warmed anyway, even when we don't
use
  SOLR...
 
 
  -Fuad
 
 
 
 
 
 
 
 
  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: November-02-09 6:00 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  OK I think someone who knows how Solr uses the fieldCache for this
  type of field will have to pipe up.
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.  (Each also consume
  negligible (for your case) memory to hold the actual string values).
 
  Note that for your use case, this is exceptionally wasteful.  If
  Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
  then it'd take much fewer bits to reference the values, since you have
  only 10 unique string values.
 
  Mike
 
  On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted
(by
  this field) SOLR query for all documents *:* - in this case it will be
 
  fully
 
  populated...
 
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 
  Any thoughts regarding the subject? I hope FieldCache doesn't use
 
  more
 
  than
 
  6 bytes per document-field instance... I am too lazy to research
 
  Lucene
 
  source code, I hope someone can provide exact answer... Thanks
 
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have
100
  millions docs with non-tokenized field country (10 different
 
  countries);
 
  I
 
  expect it requires array of (int, long), size of array
 
  100,000,000

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I just did some tests in a completely new index (Slave), sort by
low-distributed non-tokenized Field (such as Country) takes milliseconds,
but sort (ascending) on tokenized field with heavy distribution took 30
seconds (initially). Second sort (descending) took milliseconds. Generic
query *.*; FieldCache is not used for tokenized fields... how it is sorted
:)
Fortunately, no any OOM.
-Fuad




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Mark,

I don't understand this: 
 so with a ton of docs and a few uniques, you get a temp boost in the RAM
 reqs until it sizes it down.

Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
not cache?


And this:
 A pointer for each doc.

Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to
an object in RAM is not natural (in Lucene world)...


So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... 
-Fuad







RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I believe this is correct estimate:

 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]

   same as 
 [String1_Document_Count + ... + String10_Document_Count + ...] 
 x [4 bytes per DocumentID]


So, for 100 millions docs we need 400Mb for each(!) non-tokenized field.
Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't
rely on sizing down with SOLR faceting features


I think I finally found the answer...

  /** Expert: Stores term text values and document ordering data. */
  public static class StringIndex {
...   
/** All the term values, in natural order. */
public final String[] lookup;

/** For each document, an index into the lookup array. */
public final int[] order;
...
  }



Another API:
  /** Checks the internal cache for an appropriate entry, and if none
   * is found, reads the term values in codefield/code and returns an
array
   * of size codereader.maxDoc()/code containing the value each document
   * has in the given field.
   * @param reader  Used to get field values.
   * @param field   Which field contains the strings.
   * @return The values in the given field for each document.
   * @throws IOException  If any error occurs.
   */
  public String[] getStrings (IndexReader reader, String field)
  throws IOException;


Looks similar; cache size is [maxdoc]; however values stored are 8-byte
pointers for 64-bit JVM.


  private MapClass?,Cache caches;
  private synchronized void init() {
caches = new HashMapClass?,Cache(7);
...
caches.put(String.class, new StringCache(this));
caches.put(StringIndex.class, new StringIndexCache(this));
...
  }


StringCache and StringIndexCache use WeakHashMap internally... but objects
won't be ever garbage collected in a faceted production system...

SOLR SimpleFacets don't use getStrings API, so the hope is memory
requirements are minimized.


However, Lucene may use it internally for some queries (or, for instance, to
get access to a nontokenized cached field without reading index)... to be
safe, use this in your basic memory estimates:


[512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes]


-Fuad



 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: November-02-09 7:37 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Lucene FieldCache memory requirements
 
 
 Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
 difference between maxdoc and maxdoc + 1 for such estimate... difference
is
 between 0.4Gb and 1.2Gb...
 
 
 So, let's vote ;)
 
 A. [maxdoc] x [8 bytes ~ pointer to String object]
 
 B. [maxdoc] x [8 bytes ~ pointer to Document object]
 
 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
 - same as [String1_Document_Count + ... + String10_Document_Count] x [4
 bytes ~ DocumentID]
 
 D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]
 
 
 Please confirm that it is Pointer to Object and not Lucene Document ID...
I
 hope it is (int) Document ID...
 
 
 
 
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: November-02-09 6:52 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  It also briefly requires more memory than just that - it allocates an
  array the size of maxdoc+1 to hold the unique terms - and then sizes
down.
 
  Possibly we can use the getUnuiqeTermCount method in the flexible
  indexing branch to get rid of that - which is why I was thinking it
  might be a good idea to drop the unsupported exception in that method
  for things like multi reader and just do the work to get the right
  number (currently there is a comment that the user should do that work
  if necessary, making the call unreliable for this).
 
  Fuad Efendi wrote:
   Thank you very much Mike,
  
   I found it:
   org.apache.solr.request.SimpleFacets
   ...
   // TODO: future logic could use filters instead of the
 fieldcache if
   // the number of terms in the field is small enough.
   counts = getFieldCacheCounts(searcher, base, field,
 offset,limit,
   mincount, missing, sort, prefix);
   ...
   FieldCache.StringIndex si =
   FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
   final String[] terms = si.lookup;
   final int[] termNum = si.order;
   ...
  
  
   So that 64-bit requires more memory :)
  
  
   Mike, am I right here?
   [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents
(100mlns)]
   (64-bit JVM)
   1.2Gb RAM for this...
  
   Or, may be I am wrong:
  
   For Lucene directly, simple strings would consume an pointer (4 or 8
   bytes depending on whether your JRE is 64bit) per doc, and the string
   index would consume an int (4 bytes) per doc.
  
  
   [8 bytes (64bit)] x [number of documents (100mlns)]?
   0.8Gb
  
   Kind of Map between String and DocSet, saving 4 bytes... Key is
 String,
   and Value is array of 64-bit pointers to Document. Why 64-bit (for
 64

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Hi Mark,

Yes, I understand it now; however, how will StringIndexCache size down in a
production system faceting by Country on a homepage? This is SOLR
specific...


Lucene specific: Lucene doesn't read from disk if it can retrieve field
value for a specific document ID from cache. How will it size down in purely
Lucene-based heavy-loaded production system? Especially if this cache is
used for query optimizations.



 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 8:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
  static final class StringIndexCache extends Cache {
 StringIndexCache(FieldCache wrapper) {
   super(wrapper);
 }
 
 @Override
 protected Object createValue(IndexReader reader, Entry entryKey)
 throws IOException {
   String field = StringHelper.intern(entryKey.field);
   final int[] retArray = new int[reader.maxDoc()];
   String[] mterms = new String[reader.maxDoc()+1];
   TermDocs termDocs = reader.termDocs();
   TermEnum termEnum = reader.terms (new Term (field));
   int t = 0;  // current term number
 
   // an entry for documents that have no terms in this field
   // should a document with no terms be at top or bottom?
   // this puts them at the top - if it is changed,
 FieldDocSortedHitQueue
   // needs to change as well.
   mterms[t++] = null;
 
   try {
 do {
   Term term = termEnum.term();
   if (term==null || term.field() != field) break;
 
   // store term text
   // we expect that there is at most one term per document
   if (t = mterms.length) throw new RuntimeException (there are
 more terms than  +
   documents in field \ + field + \, but it's
 impossible to sort on  +
   tokenized fields);
   mterms[t] = term.text();
 
   termDocs.seek (termEnum);
   while (termDocs.next()) {
 retArray[termDocs.doc()] = t;
   }
 
   t++;
 } while (termEnum.next());
   } finally {
 termDocs.close();
 termEnum.close();
   }
 
   if (t == 0) {
 // if there are no terms, make the term array
 // have a single null entry
 mterms = new String[1];
   } else if (t  mterms.length) {
 // if there are less terms than documents,
 // trim off the dead array space
 String[] terms = new String[t];
 System.arraycopy (mterms, 0, terms, 0, t);
 mterms = terms;
   }
 
   StringIndex value = new StringIndex (retArray, mterms);
   return value;
 }
   };
 
 The formula for a String Index fieldcache is essentially the String
 array of unique terms (which does indeed size down at the bottom) and
 the int array indexing into the String array.
 
 
 Fuad Efendi wrote:
  To be correct, I analyzed FieldCache awhile ago and I believed it never
  sizes down...
 
  /**
   * Expert: The default cache implementation, storing all values in
memory.
   * A WeakHashMap is used for storage.
   *
   * pCreated: May 19, 2004 4:40:36 PM
   *
   * @since   lucene 1.4
   */
 
 
  Will it size down? Only if we are not faceting (as in SOLR v.1.3)...
 
  And I am still unsure, Document ID vs. Object Pointer.
 
 
 
 
 
  I don't understand this:
 
  so with a ton of docs and a few uniques, you get a temp boost in the
RAM
  reqs until it sizes it down.
 
  Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it
is
  not cache?
 
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Even in simplistic scenario, when it is Garbage Collected, we still
_need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear
dependency on document count...


 
 Hi Mark,
 
 Yes, I understand it now; however, how will StringIndexCache size down in
a
 production system faceting by Country on a homepage? This is SOLR
 specific...
 
 
 Lucene specific: Lucene doesn't read from disk if it can retrieve field
 value for a specific document ID from cache. How will it size down in
purely
 Lucene-based heavy-loaded production system? Especially if this cache is
 used for query optimizations.
 




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
FieldCache uses internally WeakHashMap... nothing wrong, but... no any
Garbage Collection tuning will help in case if allocated RAM is not enough
for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15%
CPU taken by GC were reported...
-Fuad





Lucene FieldCache memory requirements

2009-10-30 Thread Fuad Efendi
Hi,


Can anyone confirm Lucene FieldCache memory requirements? I have 100
millions docs with non-tokenized field country (10 different countries); I
expect it requires array of (int, long), size of array 100,000,000,
without any impact of country field length; 

it requires 600,000,000 bytes: int is pointer to document (Lucene document
ID),  and long is pointer to String value...

Am I right, is it 600Mb just for this country (indexed, non-tokenized,
non-boolean) field and 1 million docs? I need to calculate exact minimum RAM
requirements... 

I believe it shouldn't depend on cardinality (distribution) of field...

Thanks,
Fuad







RE: Too many open files

2009-10-24 Thread Fuad Efendi

I had extremely specific use case; about 5000 documents-per-second (small
documents) update rate, some documents can be repeatedly sent to SOLR with
different timestamp field (and same unique document ID). Nothing breaks,
just a great performance gain which was impossible with 32GB Buffer (- it
caused constant index merge, 5 times more CPU than index update). Nothing
breaks... with indexMerge=10 I don't have ANY merge during 24 hours;
segments are large (few of 4Gb-8Gb, and one large union); I have merge
explicitly only, at night, when I issue commit.


Of course, it depends on use case, for applications such as Content
Management System we don't need high remBufferSizeMB (few updates a day
sent to SOLR)...



 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: October-23-09 5:28 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Too many open files
 
 8 GB is much larger than is well supported. Its diminishing returns over
 40-100 and mostly a waste of RAM. Too high and things can break. It
 should be well below 2 GB at most, but I'd still recommend 40-100.
 
 Fuad Efendi wrote:
  Reason of having big RAM buffer is lowering frequency of IndexWriter
flushes
  and (subsequently) lowering frequency of index merge events, and
  (subsequently) merging of a few larger files takes less time...
especially
  if RAM Buffer is intelligent enough (and big enough) to deal with 100
  concurrent updates of existing document without 100-times flushing to
disk
  of 100 document versions.
 
  I posted here thread related; I had 1:5 timing for Update:Merge (5
minutes
  merge, and 1 minute update) with default SOLR settings (32Mb buffer). I
  increased buffer to 8Gb on Master, and it triggered significant indexing
  performance boost...
 
  -Fuad
  http://www.linkedin.com/in/liferay
 
 
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: October-23-09 3:03 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Too many open files
 
  I wouldn't use a RAM buffer of a gig - 32-100 is generally a good
number.
 
  Fuad Efendi wrote:
 
  I was partially wrong; this is what Mike McCandless
(Lucene-in-Action,
 
  2nd
 
  edition) explained at Manning forum:
 
  mergeFactor of 1000 means you will have up to 1000 segments at each
 
  level.
 
  A level 0 segment means it was flushed directly by IndexWriter.
  After you have 1000 such segments, they are merged into a single level
1
  segment.
  Once you have 1000 level 1 segments, they are merged into a single
level
 
  2
 
  segment, etc.
  So, depending on how many docs you add to your index, you'll could
have
  1000s of segments w/ mergeFactor=1000.
 
  http://www.manning-sandbox.com/thread.jspa?threadID=33784tstart=0
 
 
  So, in case of mergeFactor=100 you may have (theoretically) 1000
 
  segments,
 
  10-20 files each (depending on schema)...
 
 
  mergeFactor=10 is default setting... ramBufferSizeMB=1024 means that
you
  need at least double Java heap, but you have -Xmx1024m...
 
 
  -Fuad
 
 
 
 
  I am getting too many open files error.
 
  Usually I test on a server that has 4GB RAM and assigned 1GB for
  tomcat(set JAVA_OPTS=-Xms256m -Xmx1024m), ulimit -n is 256 for this
  server and has following setting for SolrConfig.xml
 
 
 
  useCompoundFiletrue/useCompoundFile
 
  ramBufferSizeMB1024/ramBufferSizeMB
 
  mergeFactor100/mergeFactor
 
  maxMergeDocs2147483647/maxMergeDocs
 
  maxFieldLength1/maxFieldLength
 
 
 
 
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 
 
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 





RE: Too many open files

2009-10-24 Thread Fuad Efendi
Thanks for pointing to it, but it is so obvious:

1. Buffer is used as a RAM storage for index updates
2. int has 2 x Gb different values (2^^32)
3. We can have _up_to_ 2Gb of _Documents_ (stored as key-value pairs,
inverted index)

In case of 5 fields which I have, I need 5 arrays (up to 2Gb of size for
each) to store inverted pointers, so that there is no any theoretical limit:

 Also, from the javadoc in IndexWriter:
 
* p bNOTE/b: because IndexWriter uses
* codeint/codes when managing its internal storage,
* the absolute maximum value for this setting is somewhat
* less than 2048 MB.  The precise limit depends on
* various factors, such as how large your documents are,
* how many fields have norms, etc., so it's best to set
* this value comfortably under 2048./p



Note also, I use norms etc...
 




  1   2   3   >