DIH - LastModifiedDate - Format

2014-02-16 Thread PeriS
Hi,

I am using MySQL as the datastore and for the last_modified_date use the 
java.util.Date. I m seeing that the DIH doesn’t seem to pick records; Is there 
a date format that I should use for DIH to compare properly and pick up the 
records for indexing?

Thanks
-Peri.S

*** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
recipient, please delete without copying and kindly advise us by e-mail of the 
mistake in delivery.
NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
Services to any order or other contract unless pursuant to explicit written 
agreement or government initiative expressly permitting the use of e-mail for 
such purpose.




Re: update in SolrCloud through C++ client

2014-02-16 Thread Ramkumar R. Aiyengar
If only availability is your concern, you can always keep a list of servers
to which your C++ clients will send requests, and round robin amongst them.
If one of the servers go down, you will either not be able to reach it or
get a 500+ error in the HTTP response, you can take it out of circulation
(and probably retry in the background with some kind of a ping every minute
or so to these down servers to ascertain if they have come back and then
add them to the list). This is something SolrJ does currently. This doesn't
technically need any Zookeeper interaction.

The biggest benefit that SolrJ provides (since 4.6 I think) though is that
it finds the shard leader to send an update to using ZK and saves a hop.
You can technically do this by retrieving and listening to updates using a
C++ ZK client (available) and doing what SolrJ currently does. This would
be good, the only drawback though, apart from the effort, is that
improvements are still happening in the area of managing clusters and how
its state is saved with ZK. These changes might not break your code, but at
the same time you might not be able to take advantage of them without
additional effort.

An alternative approach is to link SolrJ into your C++ client using JNI.
This has the added benefit of using the Javabin format for requests which
would have some performance benefits.

In short, it comes down to what performance requirements are. If indexing
speed and throughput is not that big a deal, just go with having a list of
servers and load balancing amongst the active ones. I would suggest you try
this anyway before second guessing that you do need the optimization.

If not, I would probably try the JNI route,  and if that fails, using a C
ZK client to read the cluster state and using that knowledge to decide
where to send requests.
On 14 Feb 2014 10:58, "neerajp"  wrote:

> Hello All,
> I am using Solr for indexing my data. My client is in C++. So I make Curl
> request to Solr server for indexing.
> Now, I want to use indexing in SolrCloud mode using ZooKeeper for HA.  I
> read the wiki link of SolrCloud (http://wiki.apache.org/solr/SolrCloud).
>
> What I understand from wiki that we should always check solr instance
> status(up & running) in solrCloud before making an update request. Can I
> not
> send update request to zookeeper and let the zookeeper forwards it to
> appropriate replica/leader ? In the later case I need not to worry which
> servers are up and running before making indexing request.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/update-in-SolrCloud-through-C-client-tp4117340.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Luke 4.6.1 released

2014-02-16 Thread Dmitry Kan
Hello!

Luke 4.6.1 has been just released. Grab it here:

https://github.com/DmitryKey/luke/releases/tag/4.6.1

fixes:
loading the jar from command line is now working fine.

-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: twitter.com/dmitrykan


Re: Solr Hot Cpu and high load

2014-02-16 Thread Nitin Sharma
Thanks Tri


*a. Are you docs distributed evenly across shards: number of docs and size
of the shards*
>> Yes the size of all the shards is equal (an ignorable delta in the order
of KB) and so are the # of docs

*b. Is your test client querying all nodes, or all the queries go to those
2 busy nodes?*
*>> *Yes all nodes are receiving exactly the same amount of queries


I have one more question. Do stored fields have significant impact on
performance of solr queries? Having 50% of the fields stored ( out of 100
fields) significantly worse that having 20% of the fields stored?
(signficantly == orders of 100s of milliseconds assuming all fields are of
the same size and type)

How are stored fields retrieved in general (always from disk or loaded into
memory in the first query and then going forward read from memory?)

Thanks
Nitin



On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao  wrote:

> 1. Yes, that's the right way to go, well, in theory at least :)
> 2. Yes, queries are alway fanned to all shards and will be as slow as the
> slowest shard. When I looked into
> Solr distributed querying implementation a few months back, the support
> for graceful degradation for things
> like network failures and slow shards was not there yet.
> 3. I doubt mmap settings would impact your read-only load, and it seems
> you can easily
> fit your index in RAM. You could try to warm the file cache to make sure
> with "cat $sorl_dir > /dev/null".
>
> It's odd that only 2 nodes are at 100% in your set up. I would check a
> couple of things:
> a. Are you docs distributed evenly across shards: number of docs and size
> of the shards
> b. Is your test client querying all nodes, or all the queries go to those
> 2 busy nodes?
>
> Regards,
> Tri
>
> On Feb 14, 2014, at 10:52 AM, Nitin Sharma 
> wrote:
>
> Hell folks
>
> We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster
> with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the
> solrconfig used by our collections
>
> We have many collections and some of them are relatively very large
> compared to the other. The size of the shard of these big collections are
> in the order of Gigabytes.We decided to split the bigger collection evenly
> across all nodes (8 shards and 2 replicas) with maxNumShards > 1.
>
> We did a test with a read load only on one big collection and we still see
> only 2 nodes running 100% CPU and the rest are blazing through the queries
> way faster (under 30% cpu). [Despite all of them being sharded across all
> nodes]
>
> I checked the JVM usage and found that none of the pools have high
> utilization (except Survivor space which is 100%). The GC cycles are in
> the order of ms and mostly doing scavenge. Mark and sweep occurs once every
> 30 minutes
>
> Few questions:
>
> 1. Sharding all collections (small and large) across all nodes evenly
>
> distributes the load and makes the system characteristics of all machines
> similar. Is this a recommended way to do ?
> 2. Solr Cloud does a distributed query by default. So if a node is at
>
> 100% CPU does it slow down the response time for the other nodes waiting
> for this query? (or does it have a timeout if it cannot get a response from
> a node within x seconds?)
> 3. Our collections use Mmap directory but i specifically haven't enabled
>
> anything related to mmaps (locked pages under ulimit ). Does it adverse
> affect performance? or can lock pages even without this?
>
> Thanks a lot in advance.
> Nitin
>
>


-- 
- N


Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-16 Thread lboutros
Thanks a lot for your answer.

Is there a web page, on the wiki for instance, where we could find some JVM
settings or recommandations that we should used for Solr with some index
configurations? 

Ludovic.





-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-16 Thread Ramkumar R. Aiyengar
Start with http://wiki.apache.org/solr/SolrPerformanceProblems It has a
section on GC tuning and a link to some example settings.
On 16 Feb 2014 21:19, "lboutros"  wrote:

> Thanks a lot for your answer.
>
> Is there a web page, on the wiki for instance, where we could find some JVM
> settings or recommandations that we should used for Solr with some index
> configurations?
>
> Ludovic.
>
>
>
>
>
> -
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Luke 4.6.1 released

2014-02-16 Thread Alexandre Rafalovitch
Does it work with Solr? I couldn't tell what the description was from
this repo and it's Solr relevance.

I am sure all the long timers know, but for more recent Solr people,
the additional information would be useful.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan  wrote:
> Hello!
>
> Luke 4.6.1 has been just released. Grab it here:
>
> https://github.com/DmitryKey/luke/releases/tag/4.6.1
>
> fixes:
> loading the jar from command line is now working fine.
>
> --
> Dmitry Kan
> Blog: http://dmitrykan.blogspot.com
> Twitter: twitter.com/dmitrykan


Re: Luke 4.6.1 released

2014-02-16 Thread Bill Bell
Yes it works with Solr 

Bill Bell
Sent from mobile


> On Feb 16, 2014, at 3:38 PM, Alexandre Rafalovitch  wrote:
> 
> Does it work with Solr? I couldn't tell what the description was from
> this repo and it's Solr relevance.
> 
> I am sure all the long timers know, but for more recent Solr people,
> the additional information would be useful.
> 
> Regards,
>   Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
>> On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan  wrote:
>> Hello!
>> 
>> Luke 4.6.1 has been just released. Grab it here:
>> 
>> https://github.com/DmitryKey/luke/releases/tag/4.6.1
>> 
>> fixes:
>> loading the jar from command line is now working fine.
>> 
>> --
>> Dmitry Kan
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: twitter.com/dmitrykan


Re: Solr Hot Cpu and high load

2014-02-16 Thread Erick Erickson
Stored fields are what the Solr DocumentCache in solrconfig.xml
is all about.

My general feeling is that stored fields are mostly irrelevant for
search speed, especially if lazy-loading is enabled. The only time
stored fields come in to play is when assembling the final result
list, i.e. the 10 or 20 documents that you return. That does imply
disk I/O, and if you have massive fields theres also decompression
to add to the CPU load.

So, as usual, "it depends". Try measuring where you restrict the returned
fields to whatever your  field is for one set of tests, then
try returning _everything_ for another?

Best,
Erick


On Sun, Feb 16, 2014 at 12:18 PM, Nitin Sharma
wrote:

> Thanks Tri
>
>
> *a. Are you docs distributed evenly across shards: number of docs and size
> of the shards*
> >> Yes the size of all the shards is equal (an ignorable delta in the order
> of KB) and so are the # of docs
>
> *b. Is your test client querying all nodes, or all the queries go to those
> 2 busy nodes?*
> *>> *Yes all nodes are receiving exactly the same amount of queries
>
>
> I have one more question. Do stored fields have significant impact on
> performance of solr queries? Having 50% of the fields stored ( out of 100
> fields) significantly worse that having 20% of the fields stored?
> (signficantly == orders of 100s of milliseconds assuming all fields are of
> the same size and type)
>
> How are stored fields retrieved in general (always from disk or loaded into
> memory in the first query and then going forward read from memory?)
>
> Thanks
> Nitin
>
>
>
> On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao  wrote:
>
> > 1. Yes, that's the right way to go, well, in theory at least :)
> > 2. Yes, queries are alway fanned to all shards and will be as slow as the
> > slowest shard. When I looked into
> > Solr distributed querying implementation a few months back, the support
> > for graceful degradation for things
> > like network failures and slow shards was not there yet.
> > 3. I doubt mmap settings would impact your read-only load, and it seems
> > you can easily
> > fit your index in RAM. You could try to warm the file cache to make sure
> > with "cat $sorl_dir > /dev/null".
> >
> > It's odd that only 2 nodes are at 100% in your set up. I would check a
> > couple of things:
> > a. Are you docs distributed evenly across shards: number of docs and size
> > of the shards
> > b. Is your test client querying all nodes, or all the queries go to those
> > 2 busy nodes?
> >
> > Regards,
> > Tri
> >
> > On Feb 14, 2014, at 10:52 AM, Nitin Sharma 
> > wrote:
> >
> > Hell folks
> >
> > We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster
> > with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the
> > solrconfig used by our collections
> >
> > We have many collections and some of them are relatively very large
> > compared to the other. The size of the shard of these big collections are
> > in the order of Gigabytes.We decided to split the bigger collection
> evenly
> > across all nodes (8 shards and 2 replicas) with maxNumShards > 1.
> >
> > We did a test with a read load only on one big collection and we still
> see
> > only 2 nodes running 100% CPU and the rest are blazing through the
> queries
> > way faster (under 30% cpu). [Despite all of them being sharded across all
> > nodes]
> >
> > I checked the JVM usage and found that none of the pools have high
> > utilization (except Survivor space which is 100%). The GC cycles are in
> > the order of ms and mostly doing scavenge. Mark and sweep occurs once
> every
> > 30 minutes
> >
> > Few questions:
> >
> > 1. Sharding all collections (small and large) across all nodes evenly
> >
> > distributes the load and makes the system characteristics of all machines
> > similar. Is this a recommended way to do ?
> > 2. Solr Cloud does a distributed query by default. So if a node is at
> >
> > 100% CPU does it slow down the response time for the other nodes waiting
> > for this query? (or does it have a timeout if it cannot get a response
> from
> > a node within x seconds?)
> > 3. Our collections use Mmap directory but i specifically haven't enabled
> >
> > anything related to mmaps (locked pages under ulimit ). Does it adverse
> > affect performance? or can lock pages even without this?
> >
> > Thanks a lot in advance.
> > Nitin
> >
> >
>
>
> --
> - N
>


Solr index filename doesn't match with solr vesion

2014-02-16 Thread Nguyen Manh Tien
Hello,

I upgraded recently from solr 4.0 to solr 4.6,
I check solr index folder and found there file

_aars_*Lucene41*_0.doc
_aars_*Lucene41*_0.pos
_aars_*Lucene41*_0.tim
_aars_*Lucene41*_0.tip

I don't know why it don't have *Lucene46* in file name.

Is there something wrong?

Thanks,
Tien


query parameters

2014-02-16 Thread Andreas Owen

in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a "in" operator like in sql 
for the list value)


       explicit
       10
       edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   
   (organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')
   (expiration:[NOW TO *] OR (*:* 
-expiration:*))^6  
   div(clicks,max(displays,1))^8 
   






Re: Solr index filename doesn't match with solr vesion

2014-02-16 Thread Shawn Heisey
On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote:
> I upgraded recently from solr 4.0 to solr 4.6,
> I check solr index folder and found there file
> 
> _aars_*Lucene41*_0.doc
> _aars_*Lucene41*_0.pos
> _aars_*Lucene41*_0.tim
> _aars_*Lucene41*_0.tip
> 
> I don't know why it don't have *Lucene46* in file name.

This is an indication that this part of the index is using a file format
introduced in Lucene 4.1.

Here's what I have for one of my index segments on a Solr 4.6.1 server:

_5s7_2h.del
_5s7.fdt
_5s7.fdx
_5s7.fnm
_5s7_Lucene41_0.doc
_5s7_Lucene41_0.pos
_5s7_Lucene41_0.tim
_5s7_Lucene41_0.tip
_5s7_Lucene45_0.dvd
_5s7_Lucene45_0.dvm
_5s7.nvd
_5s7.nvm
_5s7.si
_5s7.tvd
_5s7.tvx

It shows the same pieces as your list, but I am also using docValues in
my index, and those files indicate that they are using the format from
Lucene 4.5.  I'm not sure why there are not version numbers in *all* of
the file extensions -- that happens in the Lucene layer, which is a bit
of a mystery to me.

Thanks,
Shawn



Increasing number of SolrIndexSearcher (Leakage)?

2014-02-16 Thread Nguyen Manh Tien
Hello,

My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1.
I check heap dump and found that it has many SolrIndexSearcher (SIS)
objects (24), i expect only 1 SIS because we have 1 core.

I make some experiment
- Right after start solr, it has only 1 SolrIndexSearcher
- *But after i index some docs and run softCommit or hardCommit with
openSearcher=false, number of SolrIndexSearcher increase by 1*
- When hard commit with openSearcher=true, nubmer of SolrIndexSearcher
(SIS) doesn't increase but i foudn it log, it open new searcher, i guest
old SIS closed.

I don't know why number of SIS increase like this and finally cause
OutOfMemory, can SolrIndexSearcher be leak?

Regards,
Tien


Re: Solr index filename doesn't match with solr vesion

2014-02-16 Thread Tri Cao
Lucene main file formats actually don't change a lot in 4.x (or even 5.x), and the newer codecs just delegate to previous versions for most file types. The newer file types don't typically include Lucene's version in file names.For example, Lucene 4.6 codes basically delegate stored fields and term vector file format to 4.1, doc format to 4.0, etc. and only implement the new segment info/fields info formats (the .si and .fnm files).https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/lucene/core/src/java/org/apache/lucene/codecs/lucene46/Lucene46Codec.java#L50Hope this helps,TriOn Feb 16, 2014, at 08:52 PM, Shawn Heisey  wrote:On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote:I upgraded recently from solr 4.0 to solr 4.6,I check solr index folder and found there file_aars_*Lucene41*_0.doc_aars_*Lucene41*_0.pos_aars_*Lucene41*_0.tim_aars_*Lucene41*_0.tipI don't know why it don't have *Lucene46* in file name. This is an indication that this part of the index is using a file format introduced in Lucene 4.1.  Here's what I have for one of my index segments on a Solr 4.6.1 server:  _5s7_2h.del _5s7.fdt _5s7.fdx _5s7.fnm _5s7_Lucene41_0.doc _5s7_Lucene41_0.pos _5s7_Lucene41_0.tim _5s7_Lucene41_0.tip _5s7_Lucene45_0.dvd _5s7_Lucene45_0.dvm _5s7.nvd _5s7.nvm _5s7.si _5s7.tvd _5s7.tvx  It shows the same pieces as your list, but I am also using docValues in my index, and those files indicate that they are using the format from Lucene 4.5. I'm not sure why there are not version numbers in *all* of the file extensions -- that happens in the Lucene layer, which is a bit of a mystery to me.  Thanks, Shawn 

Re: Increasing number of SolrIndexSearcher (Leakage)?

2014-02-16 Thread Shawn Heisey
On 2/16/2014 11:34 PM, Nguyen Manh Tien wrote:
> My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1.
> I check heap dump and found that it has many SolrIndexSearcher (SIS)
> objects (24), i expect only 1 SIS because we have 1 core.
> 
> I make some experiment
> - Right after start solr, it has only 1 SolrIndexSearcher
> - *But after i index some docs and run softCommit or hardCommit with
> openSearcher=false, number of SolrIndexSearcher increase by 1*
> - When hard commit with openSearcher=true, nubmer of SolrIndexSearcher
> (SIS) doesn't increase but i foudn it log, it open new searcher, i guest
> old SIS closed.
> 
> I don't know why number of SIS increase like this and finally cause
> OutOfMemory, can SolrIndexSearcher be leak?

It's always possible that you've hit a bug that results in a memory
leak, but it is not likely.  I'm running version 4.6.1 in production
without any problems.  A lot of other people are doing so as well.  I
suspect that there's a misconfiguration, a buggy JVM, or something else
that's out of the ordinary.

We'll need answers to a bunch of questions: What filesystem and
operating system are you running on?  What vendor and version is your
JVM?  Can you use a file sharing site or a paste website to share your
full solrconfig.xml file?  What servlet container are you using to run
Solr?  Depending on what we learn from these answers, more questions
might be coming.

Are there any messages at WARN or ERROR in your Solr logfile?  Note that
I am not referring to the logging tab in the admin UI here - you'll need
to look at the actual logfile.

Thanks,
Shawn