Schema design for parent child field

2013-06-29 Thread Sperrink
Good day,
I'm seeking some guidance on how best to represent the following data within
a solr schema.
I have a list of subjects which are detailed to n levels.
Each document can contain many of these subject entities.
As I see it if this had been just 1 subject per document, dynamic fields
would have been a good resolution.
Any suggestions on how best to create this structure in a denormalised
fashion while maintaining the data integrity.
For example a document could have:
Subject level 1: contract
Subject level 2: claims
Subject level 1: patent
Subject level 2: counter claims

If I were to search for level 1 contract, I would only want the facet count
for level 2 to contain claims and not counter claims.

Any assistance in this would be much appreciated.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html
Sent from the Solr - User mailing list archive at Nabble.com.


increase search score of certain category only for certain keyword

2013-06-29 Thread winsu
Hi,

Currently i've certain sample data :
name : summer boot
category : boot shoe

name  : snow boot
category : boot shoe

name : boot pant
category : pants

name : modern boot pant
category : pants

name : modern bootcut
category : pants


If the keyword search boot , how to make the item with category shoe has
higher rank than pants ? 

can we setting at Solr to tell solr for certain keyword we need to give
boot shoe higher rank than other category ?
Thx :)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-search-score-of-certain-category-only-for-certain-keyword-tp4074051.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-29 Thread Ahmet Arslan
Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





 From: Mike L. javaone...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
 

 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the recommended approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the standard approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
dataConfig
 dataSource name=file type=FileDataSource /
   document
 entity name=entity1
 processor=LineEntityProcessor
 url=[location_of_file]/file.csv
 dataSource=file
 transformer=RegexTransformer,TemplateTransformer
 field column=rawLine
    
regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
    
groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
 /entity
   /document
/dataConfig
 
Thanks in advance,
Mike

Http status 503 Error in solr cloud setup

2013-06-29 Thread Sagar Chaturvedi
Hi,

I setup 2 solr instances on 2 different machines and configured 2 zookeeper 
servers on these machines also. When I start solr on both machines and try to 
access the solr web-admin then I get following error on browser -
Http status 503 - server is shutting down

When I setup a single standalone solr without zookeeper, I do not get this 
error.

Any insights ?

Thanks and Regards,
Sagar Chaturvedi
Member Of Technical Staff
NEC Technologies India, Noida
[cid:image001.jpg@01CE74F4.F9A4EA60]09711931646
[cid:image002.jpg@01CE74F4.F9A4EA60]





DISCLAIMER:
---
The contents of this e-mail and any attachment(s) are confidential and
intended
for the named recipient(s) only. 
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in 
this email are solely those of the author and may not necessarily reflect the
opinions of NEC or its affiliates. 
Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of 
this message without the prior written consent of the author of this e-mail is
strictly prohibited. If you have 
received this email in error please delete it and notify the sender
immediately. .
---

Re: Schema design for parent child field

2013-06-29 Thread Jack Krupansky
Both dynamic fields and multivalued fields are powerful Solr features that 
can be used to great effect, but only is used in moderation - a relatively 
small number of discrete values (e.g., a few dozens of strings.) Anything 
more complex and you are asking for trouble and creating a pseudo-schema 
that will be difficult to maintain or for anybody else to comprehend.


So, the simple answer to your question: Flatten, in the most straightforward 
manner - each instance of a record type should be a discrete Solr 
document, give each record its own id to be the Solr document key/ID. 
Solr can support multiple document types in the same collection, or you can 
store each record type in separate collection.


The simplest, cleanest structure is to store each record type in a separate 
collection and then use multiple Solr queries to emulate SQL join operations 
as needed.


But if you would prefer to mash multiple record types into the same Solr 
collection/schema, you can do that too. Make the schema be the union of the 
schemas for each record type - Solr/Lucene has no significant overhead for 
fields which do not have values present for a given document.


Each document would have a unique ID field. In addition, each document would 
have a parent field for each record type, so you can quickly search for all 
children of a given parent. You can have one common parent ID if you assign 
unique IDs to all children across all record types, but it can sometimes be 
cleaner for the child ID to reset to zero/one for each new parent. It's 
merely a question of whether you want to have a single key value or a tuple 
of key values to identify a specific child.


You can duplicate a subset of the parent fields in each child to simulate 
the effect of a simple join in a single clean query. But you can do a 
separate query to get parent record details.


-- Jack Krupansky

-Original Message- 
From: Sperrink

Sent: Saturday, June 29, 2013 5:08 AM
To: solr-user@lucene.apache.org
Subject: Schema design for parent child field

Good day,
I'm seeking some guidance on how best to represent the following data within
a solr schema.
I have a list of subjects which are detailed to n levels.
Each document can contain many of these subject entities.
As I see it if this had been just 1 subject per document, dynamic fields
would have been a good resolution.
Any suggestions on how best to create this structure in a denormalised
fashion while maintaining the data integrity.
For example a document could have:
Subject level 1: contract
Subject level 2: claims
Subject level 1: patent
Subject level 2: counter claims

If I were to search for level 1 contract, I would only want the facet count
for level 2 to contain claims and not counter claims.

Any assistance in this would be much appreciated.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: increase search score of certain category only for certain keyword

2013-06-29 Thread Jack Krupansky

Use the edismax query parser with a higher boost for category than name:

   qf=name category^10.0

Tune the boost as needed for your app.

Make sure name and category have both text and string variants - use 
copyField. The string variant is good for facets, the text variant is good 
for keyword search. Use the text variant in qf.


-- Jack Krupansky

-Original Message- 
From: winsu

Sent: Friday, June 28, 2013 9:26 PM
To: solr-user@lucene.apache.org
Subject: increase search score of certain category only for certain keyword

Hi,

Currently i've certain sample data :
name : summer boot
category : boot shoe

name  : snow boot
category : boot shoe

name : boot pant
category : pants

name : modern boot pant
category : pants

name : modern bootcut
category : pants


If the keyword search boot , how to make the item with category shoe has
higher rank than pants ?

can we setting at Solr to tell solr for certain keyword we need to give
boot shoe higher rank than other category ?
Thx :)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/increase-search-score-of-certain-category-only-for-certain-keyword-tp4074051.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: cores sharing an instance

2013-06-29 Thread Peyman Faratin
its the singleton pattern, where in my case i want an object (which is RAM 
expensive) to be a centralized coordinator of application logic. 

thank you

On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com 
wrote:

 There is very little shared between multiple cores (instanceDir paths,
 logging config maybe?). Why are you trying to do this?
 
 On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com 
 wrote:
 Hi
 
 I have a multicore setup (in 4.3.0). Is it possible for one core to share an 
 instance of its class with other cores at run time? i.e.
 
 At run time core 1 makes an instance of object O_i
 
 core 1 -- object O_i
 core 2
 ---
 core n
 
 then can core K access O_i? I know they can share properties but is it 
 possible to share objects?
 
 thank you
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
I just double check my config.  We are using convertType=true.  Someone
else came up with the config so I am not sure why we are using it.  I will
try with it set to false to see if something else will break.  Thanks for
pointing that out.

This is my first time using DIH.  I really like what I have seen so far.

Bill


On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.  The
  raw debug response of DIH, it looks like the time porting of the datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.



Re: cores sharing an instance

2013-06-29 Thread Roman Chyla
Cores can be reloaded, they are inside solrcore loader /I forgot the exact
name/, and they will have different classloaders /that's servlet thing/, so
if you want singletons you must load them outside of the core, using a
parent classloader - in case of jetty, this means writing your own jetty
initialization or config to force shared class loaders. or find a place
inside the solr, before the core is created. Google for montysolr to see
the example of the first approach.

But, unless you really have no other choice, using singletons is IMHO a bad
idea in this case

Roman

On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote:

 its the singleton pattern, where in my case i want an object (which is
RAM expensive) to be a centralized coordinator of application logic.

 thank you

 On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com
wrote:

  There is very little shared between multiple cores (instanceDir paths,
  logging config maybe?). Why are you trying to do this?
 
  On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com
wrote:
  Hi
 
  I have a multicore setup (in 4.3.0). Is it possible for one core to
share an instance of its class with other cores at run time? i.e.
 
  At run time core 1 makes an instance of object O_i
 
  core 1 -- object O_i
  core 2
  ---
  core n
 
  then can core K access O_i? I know they can share properties but is it
possible to share objects?
 
  thank you
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.



Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
Setting convertType=false does solve the datetime issue.  But there are now
other columns that were working before but not working now.  Since I have
already done some research into the datetime to date issue and not been
able to find a solution, I think I will have to keep convertType set to
false and deal with the other column type that are not working now.

Thanks for your help.

Bill


On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:

 I just double check my config.  We are using convertType=true.  Someone
 else came up with the config so I am not sure why we are using it.  I will
 try with it set to false to see if something else will break.  Thanks for
 pointing that out.

 This is my first time using DIH.  I really like what I have seen so far.

 Bill


 On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using
 SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.  The
  raw debug response of DIH, it looks like the time porting of the
 datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?
  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.





Re: broken links returned from solr search

2013-06-29 Thread Erick Erickson
What links? You haven't shown us what link you're clicking on
that generates the 404 error.

You might want to review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Fri, Jun 28, 2013 at 2:04 PM, MA LIG mewa...@gmail.com wrote:

 Hello,

 I ran the solr example as described in
 http://lucene.apache.org/solr/4_3_1/tutorial.html and then loaded some doc
 files to solr as described in
 http://wiki.apache.org/solr/ExtractingRequestHandler. The commands I used
 to load the files were of the form

   curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F
 myfile=@test.doc

 I can successfully see search results in
 http://localhost:8983/solr/collection1/browse
 http://192.168.3.72:8983/solr/collection1/browse?q=test
 .

 However, when I click on a link, I get a 404 not found error. How can I
 make these links work properly?

 Thanks in advance

 -gw



Re: documentCache not used in 4.3.1?

2013-06-29 Thread Erick Erickson
It's especially weird that the hit ratio is so high and you're
not seeing anything in the cache. Are you perhaps soft
committing frequently? Soft commits throw away all the
top-level caches including documentCache I think

Erick


On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourt t...@elementspace.comwrote:

 Thanks Otis,

 Yeah I realized after sending my e-mail that doc cache does not warm,
 however I'm still lost on why there are no other metrics.

 Thanks!

 Tim


 On 28 June 2013 16:22, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Hi Tim,
 
  Not sure about the zeros in 4.3.1, but in SPM we see all these numbers
  are non-0, though I haven't had the chance to confirm with Solr 4.3.1.
 
  Note that you can't really autowarm document cache...
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt t...@elementspace.com
  wrote:
   Hey guys,
  
   This has to be a stupid question/I must be doing something wrong, but
  after
   frequent load testing with documentCache enabled under Solr 4.3.1 with
   autoWarmCount=150, I'm noticing that my documentCache metrics are
 always
   zero for non-cumlative.
  
   At first I thought my commit rate is fast enough I just never see the
   non-cumlative result, but after 100s of samples I still always get zero
   values.
  
   Here is the current output of my documentCache from Solr's admin for 1
  core:
  
   
  
  - documentCache
 
 http://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?entry=documentCache
  
 - class:org.apache.solr.search.LRUCache
 - version:1.0
 - description:LRU Cache(maxSize=512, initialSize=512,
 autowarmCount=150, regenerator=null)
 - src:$URL: https:/
 /svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/
 solr/core/src/java/org/apache/solr/search/LRUCache.java
 
 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/solr/core/src/java/org/apache/solr/search/LRUCache.java
  $
 - stats:
- lookups:0
- hits:0
- hitratio:0.00
- inserts:0
- evictions:0
- size:0
- warmupTime:0
- cumulative_lookups:65198986
- cumulative_hits:63075669
- cumulative_hitratio:0.96
- cumulative_inserts:2123317
- cumulative_evictions:1010262
 
  
   The cumulative values seem to rise, suggesting doc cache is working,
 but
  at
   the same time it seems I never see non-cumlative metrics, most
  importantly
   warmupTime.
  
   Am I doing something wrong, is this normal/by-design, or is there an
  issue
   here?
  
   Thanks for helping with my silly question! Have a good weekend,
  
   Tim
 



Re: Improving performance to return 2000+ documents

2013-06-29 Thread Erick Erickson
Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick


On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Also, I don't see a consistent response time from solr, I ran ab again and
 I get this:

 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 

 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 


 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:   x.amazonaws.com
 Server Port:8983

 Document Path:

 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   10.858 seconds
 Complete requests:  500
 Failed requests:8
(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
 Write errors:   0
 Total transferred:  769297992 bytes
 HTML transferred:   769268492 bytes
 Requests per second:46.05 [#/sec] (mean)
 Time per request:   217.167 [ms] (mean)
 Time per request:   21.717 [ms] (mean, across all concurrent requests)
 Transfer rate:  69187.90 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.3  0   2
 Processing:   110  215  72.0190 497
 Waiting:   91  180  70.5152 473
 Total:112  216  72.0191 497

 Percentage of the requests served within a certain time (ms)
   50%191
   66%225
   75%252
   80%272
   90%319
   95%364
   98%420
   99%453
  100%497 (longest request)


 Sometimes it takes a lot of time, sometimes its pretty quick.

 Thanks,
 -Utkarsh


 On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Hello,
 
  I have a usecase where I need to retrive top 2000 documents matching a
  query.
  What are the parameters (in query, solrconfig, schema) I shoud look at to
  improve this?
 
  I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
  RAM, 8vCPU and 7GB JVM heap size.
 
  I have documentCache:
documentCache class=solr.LRUCache  size=100
  initialSize=100   autowarmCount=0/
 
  allText is a copyField.
 
  This is the result I get:
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   35.999 seconds
  Complete requests:  500
  Failed requests:21
 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
  Write errors:   0
  Non-2xx responses:  2
  Total transferred:  766221660 bytes
  HTML transferred:   766191806 bytes
  Requests per second:13.89 [#/sec] (mean)
  Time per request:   719.981 [ms] (mean)
  Time per request:   71.998 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  20785.65 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.6  0   8
  Processing: 9  717 2339.6199   12611
  Waiting:9  635 2233.6164   12580
  Total:  9  718 2339.6199   12611
 
  Percentage of the requests served within a certain time (ms)
50%199
66%236
75%263
80%281
90%548
95%838
98%  12475
99%  12545
   100%  12611 (longest request)
 
  --
  Thanks,
  -Utkarsh
 



 --
 Thanks,
 -Utkarsh



Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-29 Thread Erick Erickson
Mike:

One issue is that you're forcing all the work onto the Solr
server, and single-threading to boot by using DIH. You can
consider moving to a SolrJ model where you can have
N clients sending data to Solr if you can partition the data
up amongst the N clients cleanly.

FWIW,
Erick


On Sat, Jun 29, 2013 at 8:20 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Mike,


 You could try http://wiki.apache.org/solr/UpdateCSV

 And make sure you commit at the very end.




 
  From: Mike L. javaone...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Saturday, June 29, 2013 3:15 AM
 Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5



 I've been working on improving index time with a JdbcDataSource DIH based
 config and found it not to be as performant as I'd hoped for, for various
 reasons, not specifically due to solr. With that said, I decided to switch
 gears a bit and test out FileDataSource setup... I assumed by eliminiating
 network latency, I should see drastic improvements in terms of import
 time..but I'm a bit surprised that this process seems to run much slower,
 at least the way I've initially coded it. (below)

 The below is a barebone file import that I wrote which consumes a tab
 delimited file. Nothing fancy here. The regex just seperates out the
 fields... Is there faster approach to doing this? If so, what is it?

 Also, what is the recommended approach in terms of index/importing data?
 I know thats may come across as a vague question as there are various
 options available, but which one would be considered the standard
 approach within a production enterprise environment.


 (below has been cleansed)

 dataConfig
  dataSource name=file type=FileDataSource /
document
  entity name=entity1
  processor=LineEntityProcessor
  url=[location_of_file]/file.csv
  dataSource=file
  transformer=RegexTransformer,TemplateTransformer
  field column=rawLine

 regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$

 groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
  /entity
/document
 /dataConfig

 Thanks in advance,
 Mike



Re: cores sharing an instance

2013-06-29 Thread Erick Erickson
Well, the code is all in the same JVM, so there's no
reason a singleton approach wouldn't work that I
can think of. All the multithreaded caveats apply.

Best
Erick


On Fri, Jun 28, 2013 at 3:44 PM, Peyman Faratin pey...@robustlinks.comwrote:

 Hi

 I have a multicore setup (in 4.3.0). Is it possible for one core to share
 an instance of its class with other cores at run time? i.e.

 At run time core 1 makes an instance of object O_i

 core 1 -- object O_i
 core 2
 ---
 core n

 then can core K access O_i? I know they can share properties but is it
 possible to share objects?

 thank you




Re: broken links returned from solr search

2013-06-29 Thread gilawem
Sorry, i thought it was obvious. The links that are broken are the links that 
are returned in the search results. Using the example in the documentation I 
mentioned below, to load a word doc via
curl 
http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F 
myfile=@myworddoc.doc

the broken link that shows up in the search results is 
http://localhost:8983/solr/collection1/doc1

so I just need to know where in the solr config to be able to handle requests 
when the URL points to collection/some_doc


On Jun 29, 2013, at 1:08 PM, Erick Erickson wrote:

 What links? You haven't shown us what link you're clicking on
 that generates the 404 error.
 
 You might want to review:
 http://wiki.apache.org/solr/UsingMailingLists
 
 Best
 Erick
 
 
 On Fri, Jun 28, 2013 at 2:04 PM, MA LIG mewa...@gmail.com wrote:
 
 Hello,
 
 I ran the solr example as described in
 http://lucene.apache.org/solr/4_3_1/tutorial.html and then loaded some doc
 files to solr as described in
 http://wiki.apache.org/solr/ExtractingRequestHandler. The commands I used
 to load the files were of the form
 
  curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F
 myfile=@test.doc
 
 I can successfully see search results in
 http://localhost:8983/solr/collection1/browse
 http://192.168.3.72:8983/solr/collection1/browse?q=test
 .
 
 However, when I click on a link, I get a 404 not found error. How can I
 make these links work properly?
 
 Thanks in advance
 
 -gw
 



Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
So disabling convertType does provide a workaround for my problem with
datetime column.  But the problem still exists when convertType is enabled
because DIH is not doing the conversion correctly for a solr date field.
 Solr date field does have a time portion but java.sql.Date does not.  So
DIH should not be calling ResultSet.getDate() for a solr date field.  It
should really be calling ResultSet.getTimestamp() instead.  Is the fix this
simple?  Am I missing anything?

If the fix is this simple I can submit and commit a patch to DIH.

Bill


On Sat, Jun 29, 2013 at 12:13 PM, Bill Au bill.w...@gmail.com wrote:

 Setting convertType=false does solve the datetime issue.  But there are
 now other columns that were working before but not working now.  Since I
 have already done some research into the datetime to date issue and not
 been able to find a solution, I think I will have to keep convertType set
 to false and deal with the other column type that are not working now.

 Thanks for your help.

 Bill


 On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:

 I just double check my config.  We are using convertType=true.  Someone
 else came up with the config so I am not sure why we are using it.  I will
 try with it set to false to see if something else will break.  Thanks for
 pointing that out.

 This is my first time using DIH.  I really like what I have seen so far.

 Bill


 On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using
 SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.
  The
  raw debug response of DIH, it looks like the time porting of the
 datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is
 using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?
  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.






Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Shalin Shekhar Mangar
Yes we need to use getTimestamp instead of getDate. Please create an issue.

On Sat, Jun 29, 2013 at 11:48 PM, Bill Au bill.w...@gmail.com wrote:
 So disabling convertType does provide a workaround for my problem with
 datetime column.  But the problem still exists when convertType is enabled
 because DIH is not doing the conversion correctly for a solr date field.
  Solr date field does have a time portion but java.sql.Date does not.  So
 DIH should not be calling ResultSet.getDate() for a solr date field.  It
 should really be calling ResultSet.getTimestamp() instead.  Is the fix this
 simple?  Am I missing anything?

 If the fix is this simple I can submit and commit a patch to DIH.

 Bill


 On Sat, Jun 29, 2013 at 12:13 PM, Bill Au bill.w...@gmail.com wrote:

 Setting convertType=false does solve the datetime issue.  But there are
 now other columns that were working before but not working now.  Since I
 have already done some research into the datetime to date issue and not
 been able to find a solution, I think I will have to keep convertType set
 to false and deal with the other column type that are not working now.

 Thanks for your help.

 Bill


 On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:

 I just double check my config.  We are using convertType=true.  Someone
 else came up with the config so I am not sure why we are using it.  I will
 try with it set to false to see if something else will break.  Thanks for
 pointing that out.

 This is my first time using DIH.  I really like what I have seen so far.

 Bill


 On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 The default in JdbcDataSource is to use ResultSet.getObject which
 returns the underlying database's type. The type specific methods in
 ResultSet are not invoked unless you are using convertType=true.

 Is MySQL actually returning java.sql.Timestamp objects?

 On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
  I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
 running
  into a very strange problem where data from a datetime column being
  imported with the right date but the time is 00:00:00.  I tried using
 SQL
  DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.
  The
  raw debug response of DIH, it looks like the time porting of the
 datetime
  data is already 00:00:00 in Solr jdbc query result.
 
  So I looked at the source code of DIH JdbcDataSource class.  It is
 using
  java.sql.ResultSet and its getDate() method to handle date column.  The
  getDate() method returns java.sql.Date.  The java api doc for
 java.sql.Date
 
  http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
 
  states that:
 
  To conform with the definition of SQL DATE, the millisecond values
 wrapped
  by a java.sql.Date instance must be 'normalized' by setting the hours,
  minutes, seconds, and milliseconds to zero in the particular time zone
 with
  which the instance is associated.
 
  This seems to be describing exactly my problem.  Has anyone else notice
  this problem?  Has anyone use DIH to index SQL datetime successfully?
  If
  so can you send me the relevant portion of the DIH config?
 
  Bill



 --
 Regards,
 Shalin Shekhar Mangar.







-- 
Regards,
Shalin Shekhar Mangar.


RE: documentCache not used in 4.3.1?

2013-06-29 Thread Vaillancourt, Tim
Yes, we are softCommit'ing every 1000ms, but that should be enough time to see 
metrics though, right? For example, I still get non-cumulative metrics from the 
other caches (which are also throw away). I've also curl/sampled enough that I 
probably should have seen a value by now.

If anyone else can reproduce this on 4.3.1 I will feel less crazy :).

Cheers,

Tim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, June 29, 2013 10:13 AM
To: solr-user@lucene.apache.org
Subject: Re: documentCache not used in 4.3.1?

It's especially weird that the hit ratio is so high and you're not seeing 
anything in the cache. Are you perhaps soft committing frequently? Soft commits 
throw away all the top-level caches including documentCache I think

Erick


On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourt t...@elementspace.comwrote:

 Thanks Otis,

 Yeah I realized after sending my e-mail that doc cache does not warm, 
 however I'm still lost on why there are no other metrics.

 Thanks!

 Tim


 On 28 June 2013 16:22, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Hi Tim,
 
  Not sure about the zeros in 4.3.1, but in SPM we see all these 
  numbers are non-0, though I haven't had the chance to confirm with Solr 
  4.3.1.
 
  Note that you can't really autowarm document cache...
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/ Performance 
  Monitoring -- http://sematext.com/spm
 
 
 
  On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt 
  t...@elementspace.com
  wrote:
   Hey guys,
  
   This has to be a stupid question/I must be doing something wrong, 
   but
  after
   frequent load testing with documentCache enabled under Solr 4.3.1 
   with autoWarmCount=150, I'm noticing that my documentCache metrics 
   are
 always
   zero for non-cumlative.
  
   At first I thought my commit rate is fast enough I just never see 
   the non-cumlative result, but after 100s of samples I still always 
   get zero values.
  
   Here is the current output of my documentCache from Solr's admin 
   for 1
  core:
  
   
  
  - documentCache
 
 http://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?en
 try=documentCache
  
 - class:org.apache.solr.search.LRUCache
 - version:1.0
 - description:LRU Cache(maxSize=512, initialSize=512,
 autowarmCount=150, regenerator=null)
 - src:$URL: https:/
 /svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/
 solr/core/src/java/org/apache/solr/search/LRUCache.java
 
 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/s
 olr/core/src/java/org/apache/solr/search/LRUCache.java
  $
 - stats:
- lookups:0
- hits:0
- hitratio:0.00
- inserts:0
- evictions:0
- size:0
- warmupTime:0
- cumulative_lookups:65198986
- cumulative_hits:63075669
- cumulative_hitratio:0.96
- cumulative_inserts:2123317
- cumulative_evictions:1010262
 
  
   The cumulative values seem to rise, suggesting doc cache is 
   working,
 but
  at
   the same time it seems I never see non-cumlative metrics, most
  importantly
   warmupTime.
  
   Am I doing something wrong, is this normal/by-design, or is there 
   an
  issue
   here?
  
   Thanks for helping with my silly question! Have a good weekend,
  
   Tim
 



Re: broken links returned from solr search

2013-06-29 Thread Erick Erickson
There's nothing built into the indexing process that stores URLs allowing
you to fetch the document, you have to do that yourself. I'm not sure how
the link is getting into the search results, you're assigning doc1 as the
ID of the doc, and I think the browse request handler, aka Solaritas is
constructing the link as best it can. But that is only demo code, not
intended to fetch the document.

In a typical app, you'll construct a URL for display that has meaning in
_your_ environment, typically some way for the app server to know where the
document is and how to fetch it. the browse request handler is showing you
how you'd do this, but isn't meant to actually fetch the doc.

Best
Erick


On Sat, Jun 29, 2013 at 1:29 PM, gilawem mewa...@gmail.com wrote:

 Sorry, i thought it was obvious. The links that are broken are the links
 that are returned in the search results. Using the example in the
 documentation I mentioned below, to load a word doc via
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F
 myfile=@myworddoc.doc

 the broken link that shows up in the search results is
 http://localhost:8983/solr/collection1/doc1

 so I just need to know where in the solr config to be able to handle
 requests when the URL points to collection/some_doc


 On Jun 29, 2013, at 1:08 PM, Erick Erickson wrote:

  What links? You haven't shown us what link you're clicking on
  that generates the 404 error.
 
  You might want to review:
  http://wiki.apache.org/solr/UsingMailingLists
 
  Best
  Erick
 
 
  On Fri, Jun 28, 2013 at 2:04 PM, MA LIG mewa...@gmail.com wrote:
 
  Hello,
 
  I ran the solr example as described in
  http://lucene.apache.org/solr/4_3_1/tutorial.html and then loaded some
 doc
  files to solr as described in
  http://wiki.apache.org/solr/ExtractingRequestHandler. The commands I
 used
  to load the files were of the form
 
   curl 
  http://localhost:8983/solr/update/extract?literal.id=doc1commit=true;
 -F
  myfile=@test.doc
 
  I can successfully see search results in
  http://localhost:8983/solr/collection1/browse
  http://192.168.3.72:8983/solr/collection1/browse?q=test
  .
 
  However, when I click on a link, I get a 404 not found error. How can I
  make these links work properly?
 
  Thanks in advance
 
  -gw
 




Re: documentCache not used in 4.3.1?

2013-06-29 Thread Erick Erickson
Tim:

Yeah, this doesn't make much sense to me either since,
as you say, you should be seeing some metrics upon
occasion. But do note that the underlying cache only gets
filled when getting documents to return in query results,
since there's no autowarming going on it may come and
go.

But you can test this pretty quickly by lengthening your
autocommit interval or just not indexing anything
for a while, then run a bunch of queries and look at your
cache stats. That'll at least tell you whether it works at all.
You'll have to have hard commits turned off (or openSearcher
set to 'false') for that check too.

Best
Erick


On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim tvaillanco...@ea.comwrote:

 Yes, we are softCommit'ing every 1000ms, but that should be enough time to
 see metrics though, right? For example, I still get non-cumulative metrics
 from the other caches (which are also throw away). I've also curl/sampled
 enough that I probably should have seen a value by now.

 If anyone else can reproduce this on 4.3.1 I will feel less crazy :).

 Cheers,

 Tim

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Saturday, June 29, 2013 10:13 AM
 To: solr-user@lucene.apache.org
 Subject: Re: documentCache not used in 4.3.1?

 It's especially weird that the hit ratio is so high and you're not seeing
 anything in the cache. Are you perhaps soft committing frequently? Soft
 commits throw away all the top-level caches including documentCache I
 think

 Erick


 On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourt t...@elementspace.com
 wrote:

  Thanks Otis,
 
  Yeah I realized after sending my e-mail that doc cache does not warm,
  however I'm still lost on why there are no other metrics.
 
  Thanks!
 
  Tim
 
 
  On 28 June 2013 16:22, Otis Gospodnetic otis.gospodne...@gmail.com
  wrote:
 
   Hi Tim,
  
   Not sure about the zeros in 4.3.1, but in SPM we see all these
   numbers are non-0, though I haven't had the chance to confirm with
 Solr 4.3.1.
  
   Note that you can't really autowarm document cache...
  
   Otis
   --
   Solr  ElasticSearch Support -- http://sematext.com/ Performance
   Monitoring -- http://sematext.com/spm
  
  
  
   On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
   t...@elementspace.com
   wrote:
Hey guys,
   
This has to be a stupid question/I must be doing something wrong,
but
   after
frequent load testing with documentCache enabled under Solr 4.3.1
with autoWarmCount=150, I'm noticing that my documentCache metrics
are
  always
zero for non-cumlative.
   
At first I thought my commit rate is fast enough I just never see
the non-cumlative result, but after 100s of samples I still always
get zero values.
   
Here is the current output of my documentCache from Solr's admin
for 1
   core:
   

   
   - documentCache
  
  http://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?en
  try=documentCache
   
  - class:org.apache.solr.search.LRUCache
  - version:1.0
  - description:LRU Cache(maxSize=512, initialSize=512,
  autowarmCount=150, regenerator=null)
  - src:$URL: https:/
  /svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/
  solr/core/src/java/org/apache/solr/search/LRUCache.java
  
  https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/s
  olr/core/src/java/org/apache/solr/search/LRUCache.java
   $
  - stats:
 - lookups:0
 - hits:0
 - hitratio:0.00
 - inserts:0
 - evictions:0
 - size:0
 - warmupTime:0
 - cumulative_lookups:65198986
 - cumulative_hits:63075669
 - cumulative_hitratio:0.96
 - cumulative_inserts:2123317
 - cumulative_evictions:1010262
  
   
The cumulative values seem to rise, suggesting doc cache is
working,
  but
   at
the same time it seems I never see non-cumlative metrics, most
   importantly
warmupTime.
   
Am I doing something wrong, is this normal/by-design, or is there
an
   issue
here?
   
Thanks for helping with my silly question! Have a good weekend,
   
Tim
  
 



Re: Solr 4.3.0 DIH problem with MySQL datetime being imported with time as 00:00:00

2013-06-29 Thread Bill Au
https://issues.apache.org/jira/browse/SOLR-4978


On Sat, Jun 29, 2013 at 2:33 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Yes we need to use getTimestamp instead of getDate. Please create an issue.

 On Sat, Jun 29, 2013 at 11:48 PM, Bill Au bill.w...@gmail.com wrote:
  So disabling convertType does provide a workaround for my problem with
  datetime column.  But the problem still exists when convertType is
 enabled
  because DIH is not doing the conversion correctly for a solr date field.
   Solr date field does have a time portion but java.sql.Date does not.  So
  DIH should not be calling ResultSet.getDate() for a solr date field.  It
  should really be calling ResultSet.getTimestamp() instead.  Is the fix
 this
  simple?  Am I missing anything?
 
  If the fix is this simple I can submit and commit a patch to DIH.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 12:13 PM, Bill Au bill.w...@gmail.com wrote:
 
  Setting convertType=false does solve the datetime issue.  But there are
  now other columns that were working before but not working now.  Since I
  have already done some research into the datetime to date issue and not
  been able to find a solution, I think I will have to keep convertType
 set
  to false and deal with the other column type that are not working now.
 
  Thanks for your help.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 10:24 AM, Bill Au bill.w...@gmail.com wrote:
 
  I just double check my config.  We are using convertType=true.  Someone
  else came up with the config so I am not sure why we are using it.  I
 will
  try with it set to false to see if something else will break.  Thanks
 for
  pointing that out.
 
  This is my first time using DIH.  I really like what I have seen so
 far.
 
  Bill
 
 
  On Sat, Jun 29, 2013 at 1:45 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  The default in JdbcDataSource is to use ResultSet.getObject which
  returns the underlying database's type. The type specific methods in
  ResultSet are not invoked unless you are using convertType=true.
 
  Is MySQL actually returning java.sql.Timestamp objects?
 
  On Sat, Jun 29, 2013 at 5:22 AM, Bill Au bill.w...@gmail.com wrote:
   I am running Solr 4.3.0, using DIH to import data from MySQL.  I am
  running
   into a very strange problem where data from a datetime column being
   imported with the right date but the time is 00:00:00.  I tried
 using
  SQL
   DATE_FORMAT() and also DIH DateFormatTransformer but nothing works.
   The
   raw debug response of DIH, it looks like the time porting of the
  datetime
   data is already 00:00:00 in Solr jdbc query result.
  
   So I looked at the source code of DIH JdbcDataSource class.  It is
  using
   java.sql.ResultSet and its getDate() method to handle date column.
  The
   getDate() method returns java.sql.Date.  The java api doc for
  java.sql.Date
  
   http://docs.oracle.com/javase/6/docs/api/java/sql/Date.html
  
   states that:
  
   To conform with the definition of SQL DATE, the millisecond values
  wrapped
   by a java.sql.Date instance must be 'normalized' by setting the
 hours,
   minutes, seconds, and milliseconds to zero in the particular time
 zone
  with
   which the instance is associated.
  
   This seems to be describing exactly my problem.  Has anyone else
 notice
   this problem?  Has anyone use DIH to index SQL datetime
 successfully?
   If
   so can you send me the relevant portion of the DIH config?
  
   Bill
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Improving performance to return 2000+ documents

2013-06-29 Thread Peter Sturge
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter



On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, depending on how many docs get served
 from the cache the time will vary. But this is
 just ugly, if you can avoid this use-case it would
 be a Good Thing.

 Problem here is that each and every shard must
 assemble the list of 2,000 documents (just ID and
 sort criteria, usually score).

 Then the node serving the original request merges
 the sub-lists to pick the top 2,000. Then the node
 sends another request to each shard to get
 the full document. Then the node merges this
 into the full list to return to the user.

 Solr really isn't built for this use-case, is it actually
 a compelling situation?

 And having your document cache set at 1M is kinda
 high if you have very big documents.

 FWIW,
 Erick


 On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Also, I don't see a consistent response time from solr, I ran ab again
 and
  I get this:
 
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:   x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   10.858 seconds
  Complete requests:  500
  Failed requests:8
 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
  Write errors:   0
  Total transferred:  769297992 bytes
  HTML transferred:   769268492 bytes
  Requests per second:46.05 [#/sec] (mean)
  Time per request:   217.167 [ms] (mean)
  Time per request:   21.717 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  69187.90 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.3  0   2
  Processing:   110  215  72.0190 497
  Waiting:   91  180  70.5152 473
  Total:112  216  72.0191 497
 
  Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
   100%497 (longest request)
 
 
  Sometimes it takes a lot of time, sometimes its pretty quick.
 
  Thanks,
  -Utkarsh
 
 
  On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Hello,
  
   I have a usecase where I need to retrive top 2000 documents matching a
   query.
   What are the parameters (in query, solrconfig, schema) I shoud look at
 to
   improve this?
  
   I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
   RAM, 8vCPU and 7GB JVM heap size.
  
   I have documentCache:
 documentCache class=solr.LRUCache  size=100
   initialSize=100   autowarmCount=0/
  
   allText is a copyField.
  
   This is the result I get:
   ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   
  
   Benchmarking x.amazonaws.com (be patient)
   Completed 100 requests
   Completed 200 requests
   Completed 300 requests
   Completed 400 requests
   Completed 500 requests
   Finished 500 requests
  
  
   Server Software:
   Server Hostname:x.amazonaws.com
   Server Port:8983
  
   Document Path:
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   Document Length:1538537 bytes
  
   Concurrency Level:  10
   Time taken for tests:   35.999 seconds
   Complete requests:  500
   Failed requests:21
  (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
   Write errors:   0
   Non-2xx responses:  2
   Total 

Re: Varnish

2013-06-29 Thread William Bell
OK.

Here is the answer for us. Here is a sample default.vcl. We are validating
the LastModified ( if (!beresp.http.last-modified) )
is changing when the core is indexed and the version changes of the index.

This does 10 minutes caching and a 1hr grace period (if solr is down, it
will deliver results up to 1 hr).

This uses the URL for caching.

You can also do:

http://localhost?PURGEME

To clear varnish if your IP is in the ACL list.


backend server1 {
.host = XXX.domain.com;
.port = 8983;
.probe = {
.url = /solr/pingall/select/?q=*%3A*;
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
backend server2{
.host = XXX1.domain.com;
.port = 8983;
.probe = {
.url = /solr/pingall/select/?q=*%3A*;
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
backend server3{
.host = XXX2.domain.com;
.port = 8983;
.probe = {
.url = /solr/pingall/select/?q=*%3A*;
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
backend server4{
.host = XXX3.domain.com;
.port = 8983;
.probe = {
.url = /solr/pingall/select/?q=*%3A*;
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}

director default round-robin {
  {
.backend = server1;
  }
  {
.backend = server2;
  }
  {
.backend = server3;
  }
  {
.backend = server4;
  }
}

acl purge {
localhost;
10.0.1.0/24;
10.0.3.0/24;
}


sub vcl_recv {
   if (req.url ~ \?PURGEME$) {
if (!client.ip ~ purge) {
error 405 Not allowed.  + client.ip;
}
ban(req.url ~ /);
error 200 Cached Cleared;
   }
   remove req.http.Cookie;
   if (req.backend.healthy) {
 set req.grace = 15s;
   } else {
 set req.grace = 1h;
   }
   return (lookup);
}

sub vcl_fetch {
  set beresp.grace = 1h;
  if (!beresp.http.last-modified) {
set beresp.ttl = 600s;
  }
  if (beresp.ttl  600s) {
set beresp.ttl = 600s;
  }
  unset beresp.http.Set-Cookie;
}

sub vcl_deliver {
if (obj.hits  0) {
set resp.http.X-Cache = HIT;
} else {
set resp.http.X-Cache = MISS;
}
}

sub vcl_hash {
hash_data(req.url);
return (hash);
}






On Tue, Jun 25, 2013 at 4:44 PM, Learner bbar...@gmail.com wrote:

 Check this link..
 http://lucene.472066.n3.nabble.com/SolrJ-HTTP-caching-td490063.html



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Varnish-tp4072057p4073205.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Varnish

2013-06-29 Thread William Bell
On a large website, by putting 1 varnish in front of all 4 SOLR boxes we
were able to trim 25% off the load time (TTFB) of the page.

Our hit ratio was between 55 and 75%. We gave varnish 24GB of RAM, and was
not able to fill it under full load with a 10 minute cache timeout.

We get about 2.4M SOLR calls every 15 to 20 minutes.

One varnish was able to handle it with almost no lingering connections, and
load average of  1.

Varnish is very optimized and worth trying.



On Sat, Jun 29, 2013 at 6:47 PM, William Bell billnb...@gmail.com wrote:

 OK.

 Here is the answer for us. Here is a sample default.vcl. We are validating
 the LastModified ( if (!beresp.http.last-modified) )
 is changing when the core is indexed and the version changes of the index.

 This does 10 minutes caching and a 1hr grace period (if solr is down, it
 will deliver results up to 1 hr).

 This uses the URL for caching.

 You can also do:

 http://localhost?PURGEME

 To clear varnish if your IP is in the ACL list.


 backend server1 {
 .host = XXX.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
 }
 backend server2{
 .host = XXX1.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
 }
 backend server3{
 .host = XXX2.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
 }
 backend server4{
 .host = XXX3.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
 }

 director default round-robin {
   {
 .backend = server1;
   }
   {
 .backend = server2;
   }
   {
 .backend = server3;
   }
   {
 .backend = server4;
   }
 }

 acl purge {
 localhost;
 10.0.1.0/24;
 10.0.3.0/24;
 }


 sub vcl_recv {
if (req.url ~ \?PURGEME$) {
 if (!client.ip ~ purge) {
 error 405 Not allowed.  + client.ip;
 }
 ban(req.url ~ /);
 error 200 Cached Cleared;
}
remove req.http.Cookie;
if (req.backend.healthy) {
  set req.grace = 15s;
} else {
  set req.grace = 1h;
}
return (lookup);
 }

 sub vcl_fetch {
   set beresp.grace = 1h;
   if (!beresp.http.last-modified) {
 set beresp.ttl = 600s;
   }
   if (beresp.ttl  600s) {
 set beresp.ttl = 600s;
   }
   unset beresp.http.Set-Cookie;
 }

 sub vcl_deliver {
 if (obj.hits  0) {
 set resp.http.X-Cache = HIT;
 } else {
 set resp.http.X-Cache = MISS;
 }
 }

 sub vcl_hash {
 hash_data(req.url);
 return (hash);
 }






 On Tue, Jun 25, 2013 at 4:44 PM, Learner bbar...@gmail.com wrote:

 Check this link..
 http://lucene.472066.n3.nabble.com/SolrJ-HTTP-caching-td490063.html



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Varnish-tp4072057p4073205.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Http status 503 Error in solr cloud setup

2013-06-29 Thread Lance Norskog
I do not know what causes the error. This setup will not work. You need 
one or three zookeepers. SolrCloud demands that a majority of the ZK 
servers agree. If you have two ZKs this will not work.


On 06/29/2013 05:47 AM, Sagar Chaturvedi wrote:


Hi,

I setup 2 solr instances on 2 different machines and configured 2 
zookeeper servers on these machines also. When I start solr on both 
machines and try to access the solr web-admin then I get following 
error on browser --


Http status 503 -- server is shutting down

When I setup a single standalone solr without zookeeper, I do not get 
this error.


Any insights ?

/Thanks and Regards,/

/Sagar Chaturvedi/

/Member Of Technical Staff /

/NEC Technologies India, Noida/

/09711931646/

DISCLAIMER:
---
The contents of this e-mail and any attachment(s) are confidential and
intended
for the named recipient(s) only.
It shall not attach any liability on the originator or NEC or its
affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the
opinions of NEC or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of
this message without the prior written consent of the author of this e-mail is
strictly prohibited. If you have
received this email in error please delete it and notify the sender
immediately. .
---




Re: Varnish

2013-06-29 Thread Lance Norskog
Solr HTTP caching also support e-tags. These are unique keys for the 
output of a query. If you send a query twice, and the index has not 
changed, the return will be the same. The e-tag is generated from the 
query string and the index generation number.


If Varnish supports e-tags, you can keep some queries cached longer than 
your timeout.


Lance

On 06/29/2013 05:51 PM, William Bell wrote:

On a large website, by putting 1 varnish in front of all 4 SOLR boxes we
were able to trim 25% off the load time (TTFB) of the page.

Our hit ratio was between 55 and 75%. We gave varnish 24GB of RAM, and was
not able to fill it under full load with a 10 minute cache timeout.

We get about 2.4M SOLR calls every 15 to 20 minutes.

One varnish was able to handle it with almost no lingering connections, and
load average of  1.

Varnish is very optimized and worth trying.



On Sat, Jun 29, 2013 at 6:47 PM, William Bell billnb...@gmail.com wrote:


OK.

Here is the answer for us. Here is a sample default.vcl. We are validating
the LastModified ( if (!beresp.http.last-modified) )
is changing when the core is indexed and the version changes of the index.

This does 10 minutes caching and a 1hr grace period (if solr is down, it
will deliver results up to 1 hr).

This uses the URL for caching.

You can also do:

http://localhost?PURGEME

To clear varnish if your IP is in the ACL list.


backend server1 {
 .host = XXX.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server2{
 .host = XXX1.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server3{
 .host = XXX2.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}
backend server4{
 .host = XXX3.domain.com;
 .port = 8983;
 .probe = {
 .url = /solr/pingall/select/?q=*%3A*;
 .interval = 5s;
 .timeout = 1s;
 .window = 5;
 .threshold = 3;
 }
}

director default round-robin {
   {
 .backend = server1;
   }
   {
 .backend = server2;
   }
   {
 .backend = server3;
   }
   {
 .backend = server4;
   }
}

acl purge {
 localhost;
 10.0.1.0/24;
 10.0.3.0/24;
}


sub vcl_recv {
if (req.url ~ \?PURGEME$) {
 if (!client.ip ~ purge) {
 error 405 Not allowed.  + client.ip;
 }
 ban(req.url ~ /);
 error 200 Cached Cleared;
}
remove req.http.Cookie;
if (req.backend.healthy) {
  set req.grace = 15s;
} else {
  set req.grace = 1h;
}
return (lookup);
}

sub vcl_fetch {
   set beresp.grace = 1h;
   if (!beresp.http.last-modified) {
 set beresp.ttl = 600s;
   }
   if (beresp.ttl  600s) {
 set beresp.ttl = 600s;
   }
   unset beresp.http.Set-Cookie;
}

sub vcl_deliver {
 if (obj.hits  0) {
 set resp.http.X-Cache = HIT;
 } else {
 set resp.http.X-Cache = MISS;
 }
}

sub vcl_hash {
 hash_data(req.url);
 return (hash);
}






On Tue, Jun 25, 2013 at 4:44 PM, Learner bbar...@gmail.com wrote:


Check this link..
http://lucene.472066.n3.nabble.com/SolrJ-HTTP-caching-td490063.html



--
View this message in context:
http://lucene.472066.n3.nabble.com/Varnish-tp4072057p4073205.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Bill Bell
billnb...@gmail.com
cell 720-256-8076








Re: documentCache not used in 4.3.1?

2013-06-29 Thread Tim Vaillancourt

That's a good idea, I'll try that next week.

Thanks!

Tim

On 29/06/13 12:39 PM, Erick Erickson wrote:

Tim:

Yeah, this doesn't make much sense to me either since,
as you say, you should be seeing some metrics upon
occasion. But do note that the underlying cache only gets
filled when getting documents to return in query results,
since there's no autowarming going on it may come and
go.

But you can test this pretty quickly by lengthening your
autocommit interval or just not indexing anything
for a while, then run a bunch of queries and look at your
cache stats. That'll at least tell you whether it works at all.
You'll have to have hard commits turned off (or openSearcher
set to 'false') for that check too.

Best
Erick


On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Timtvaillanco...@ea.comwrote:


Yes, we are softCommit'ing every 1000ms, but that should be enough time to
see metrics though, right? For example, I still get non-cumulative metrics
from the other caches (which are also throw away). I've also curl/sampled
enough that I probably should have seen a value by now.

If anyone else can reproduce this on 4.3.1 I will feel less crazy :).

Cheers,

Tim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, June 29, 2013 10:13 AM
To: solr-user@lucene.apache.org
Subject: Re: documentCache not used in 4.3.1?

It's especially weird that the hit ratio is so high and you're not seeing
anything in the cache. Are you perhaps soft committing frequently? Soft
commits throw away all the top-level caches including documentCache I
think

Erick


On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourtt...@elementspace.com

wrote:
Thanks Otis,

Yeah I realized after sending my e-mail that doc cache does not warm,
however I'm still lost on why there are no other metrics.

Thanks!

Tim


On 28 June 2013 16:22, Otis Gospodneticotis.gospodne...@gmail.com
wrote:


Hi Tim,

Not sure about the zeros in 4.3.1, but in SPM we see all these
numbers are non-0, though I haven't had the chance to confirm with

Solr 4.3.1.

Note that you can't really autowarm document cache...

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/ Performance
Monitoring -- http://sematext.com/spm



On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
t...@elementspace.com
wrote:

Hey guys,

This has to be a stupid question/I must be doing something wrong,
but

after

frequent load testing with documentCache enabled under Solr 4.3.1
with autoWarmCount=150, I'm noticing that my documentCache metrics
are

always

zero for non-cumlative.

At first I thought my commit rate is fast enough I just never see
the non-cumlative result, but after 100s of samples I still always
get zero values.

Here is the current output of my documentCache from Solr's admin
for 1

core:



- documentCache

http://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?en
try=documentCache

   - class:org.apache.solr.search.LRUCache
   - version:1.0
   - description:LRU Cache(maxSize=512, initialSize=512,
   autowarmCount=150, regenerator=null)
   - src:$URL: https:/
   /svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/
   solr/core/src/java/org/apache/solr/search/LRUCache.java

https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/s
olr/core/src/java/org/apache/solr/search/LRUCache.java

$
   - stats:
  - lookups:0
  - hits:0
  - hitratio:0.00
  - inserts:0
  - evictions:0
  - size:0
  - warmupTime:0
  - cumulative_lookups:65198986
  - cumulative_hits:63075669
  - cumulative_hitratio:0.96
  - cumulative_inserts:2123317
  - cumulative_evictions:1010262
   

The cumulative values seem to rise, suggesting doc cache is
working,

but

at

the same time it seems I never see non-cumlative metrics, most

importantly

warmupTime.

Am I doing something wrong, is this normal/by-design, or is there
an

issue

here?

Thanks for helping with my silly question! Have a good weekend,

Tim


Re: broken links returned from solr search

2013-06-29 Thread gilawem
OK thanks. So I guess I will set up my own normal webserver and have the solr 
server a sort of private web-based API (or possibly a front-end that, when a 
user clicks on a search result link, just redirects the user to my normal web 
server that has the related file). That's easy enough. If that's not how solr 
is supposed to be used, please feel free to let me know. Thanks!

On Jun 29, 2013, at 3:34 PM, Erick Erickson wrote:

 There's nothing built into the indexing process that stores URLs allowing
 you to fetch the document, you have to do that yourself. I'm not sure how
 the link is getting into the search results, you're assigning doc1 as the
 ID of the doc, and I think the browse request handler, aka Solaritas is
 constructing the link as best it can. But that is only demo code, not
 intended to fetch the document.
 
 In a typical app, you'll construct a URL for display that has meaning in
 _your_ environment, typically some way for the app server to know where the
 document is and how to fetch it. the browse request handler is showing you
 how you'd do this, but isn't meant to actually fetch the doc.
 
 Best
 Erick
 
 
 On Sat, Jun 29, 2013 at 1:29 PM, gilawem mewa...@gmail.com wrote:
 
 Sorry, i thought it was obvious. The links that are broken are the links
 that are returned in the search results. Using the example in the
 documentation I mentioned below, to load a word doc via
curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F
 myfile=@myworddoc.doc
 
 the broken link that shows up in the search results is
 http://localhost:8983/solr/collection1/doc1
 
 so I just need to know where in the solr config to be able to handle
 requests when the URL points to collection/some_doc
 
 
 On Jun 29, 2013, at 1:08 PM, Erick Erickson wrote:
 
 What links? You haven't shown us what link you're clicking on
 that generates the 404 error.
 
 You might want to review:
 http://wiki.apache.org/solr/UsingMailingLists
 
 Best
 Erick
 
 
 On Fri, Jun 28, 2013 at 2:04 PM, MA LIG mewa...@gmail.com wrote:
 
 Hello,
 
 I ran the solr example as described in
 http://lucene.apache.org/solr/4_3_1/tutorial.html and then loaded some
 doc
 files to solr as described in
 http://wiki.apache.org/solr/ExtractingRequestHandler. The commands I
 used
 to load the files were of the form
 
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1commit=true;
 -F
 myfile=@test.doc
 
 I can successfully see search results in
 http://localhost:8983/solr/collection1/browse
 http://192.168.3.72:8983/solr/collection1/browse?q=test
 .
 
 However, when I click on a link, I get a 404 not found error. How can I
 make these links work properly?
 
 Thanks in advance
 
 -gw