Re: Help required with SolrJ

2018-02-21 Thread Aakanksha Gupta
Thanks Shawn! That was just a small fix from my side. Thanks for your help!

On Tue, Feb 20, 2018 at 1:43 AM, Shawn Heisey  wrote:

> On 2/19/2018 8:49 AM, Aakanksha Gupta wrote:
> > Thanks for the quick solution. It works. I just had to replace %20 to
> space
> > in query.addFilterQuery("timestamp:[151890840 TO 151891200]");
> >
> > Thanks a ton! :)
>
> Right, I didn't even really look closely at what was in the fq
> parameter, I just copied it. :)  Sorry about that -- if I'd looked
> better, I would have seen that what I was sending wouldn't work.
>
> SolrJ will handle the URL encoding for you, so it would have URL encoded
> the URL encoding, and Solr would receive the fq parameter with the %20
> intact.
>
> Glad you figured it out even with my mistake!
>
> Thanks,
> Shawn
>


Re: Help required with SolrJ

2018-02-21 Thread Aakanksha Gupta
Thanks Erick.

On Tue, Feb 20, 2018 at 1:11 AM, Erick Erickson 
wrote:

> Aakanksha:
>
> Be a little careful here, filter queries with timestamps can be
> tricky. The example you have is fine, but for end-points with finer
> granularity may be best if you don't cache them, see:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
>
> Best,
> Erick
>
> On Mon, Feb 19, 2018 at 7:49 AM, Aakanksha Gupta
>  wrote:
> > Hi Shawn,
> > Thanks for the quick solution. It works. I just had to replace %20 to
> space
> > in query.addFilterQuery("timestamp:[151890840 TO 151891200]");
> >
> > Thanks a ton! :)
> >
> > On Mon, Feb 19, 2018 at 11:43 PM, Shawn Heisey 
> > wrote:
> >
> >> On 2/19/2018 6:44 AM, Aakanksha Gupta wrote:
> >>
> >>> http://localhost:8983/solr/geoloc/select/?q=*:*={!geofilt
> >>> }=latlong=-6.08165,145.8612430=100=json=
> >>> timestamp:[151890840%20TO%20151891200]=*,_dist_:geodist()
> >>>  geofilt%7D=latlong=-6.08165,145.8612430=100=
> json=timestamp:[151890840%20TO%20151891200]=*,_dist_:
> geodist()>
> >>>
> >>> But I'm not sure how to build the SolrJ equivalent of this query using
> >>> SolrQuery.
> >>>
> >>
> >> I haven't done anything with spatial yet.  But I do know how to
> translate
> >> Solr URLs into SolrJ code.  The code below constructs a query object
> >> equivalent to that URL.  If that URL works as-is, this code should do
> the
> >> same.
> >>
> >> I did not include the "wt" parameter, which controls the format of the
> >> response.  With SolrJ, the transfer format defaults to binary and should
> >> not be changed.  It CAN be changed, but any other choice would be less
> >> efficient, and the programmer doesn't need to worry about it.
> >>
> >>   query.setQuery("*:*");
> >>   query.addFilterQuery("{!geofilt}");
> >> query.addFilterQuery("timestamp:[151890840%20TO%20151891200]");
> >>   query.set("sfield", "latlong");
> >>   query.set("pt", "-6.08165,145.8612430");
> >>   query.set("d", "100");
> >>   query.setFields("*", "_dist_:geodist()");
> >>
> >> I couldn't actually test this code, as I don't have any indexes with
> >> spatial data.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Help required with SolrJ

2018-02-19 Thread Shawn Heisey
On 2/19/2018 8:49 AM, Aakanksha Gupta wrote:
> Thanks for the quick solution. It works. I just had to replace %20 to space
> in query.addFilterQuery("timestamp:[151890840 TO 151891200]");
> 
> Thanks a ton! :)

Right, I didn't even really look closely at what was in the fq
parameter, I just copied it. :)  Sorry about that -- if I'd looked
better, I would have seen that what I was sending wouldn't work.

SolrJ will handle the URL encoding for you, so it would have URL encoded
the URL encoding, and Solr would receive the fq parameter with the %20
intact.

Glad you figured it out even with my mistake!

Thanks,
Shawn


Re: Help required with SolrJ

2018-02-19 Thread Erick Erickson
Aakanksha:

Be a little careful here, filter queries with timestamps can be
tricky. The example you have is fine, but for end-points with finer
granularity may be best if you don't cache them, see:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Mon, Feb 19, 2018 at 7:49 AM, Aakanksha Gupta
 wrote:
> Hi Shawn,
> Thanks for the quick solution. It works. I just had to replace %20 to space
> in query.addFilterQuery("timestamp:[151890840 TO 151891200]");
>
> Thanks a ton! :)
>
> On Mon, Feb 19, 2018 at 11:43 PM, Shawn Heisey 
> wrote:
>
>> On 2/19/2018 6:44 AM, Aakanksha Gupta wrote:
>>
>>> http://localhost:8983/solr/geoloc/select/?q=*:*={!geofilt
>>> }=latlong=-6.08165,145.8612430=100=json=
>>> timestamp:[151890840%20TO%20151891200]=*,_dist_:geodist()
>>> 
>>>
>>> But I'm not sure how to build the SolrJ equivalent of this query using
>>> SolrQuery.
>>>
>>
>> I haven't done anything with spatial yet.  But I do know how to translate
>> Solr URLs into SolrJ code.  The code below constructs a query object
>> equivalent to that URL.  If that URL works as-is, this code should do the
>> same.
>>
>> I did not include the "wt" parameter, which controls the format of the
>> response.  With SolrJ, the transfer format defaults to binary and should
>> not be changed.  It CAN be changed, but any other choice would be less
>> efficient, and the programmer doesn't need to worry about it.
>>
>>   query.setQuery("*:*");
>>   query.addFilterQuery("{!geofilt}");
>> query.addFilterQuery("timestamp:[151890840%20TO%20151891200]");
>>   query.set("sfield", "latlong");
>>   query.set("pt", "-6.08165,145.8612430");
>>   query.set("d", "100");
>>   query.setFields("*", "_dist_:geodist()");
>>
>> I couldn't actually test this code, as I don't have any indexes with
>> spatial data.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Help required with SolrJ

2018-02-19 Thread Aakanksha Gupta
Hi Shawn,
Thanks for the quick solution. It works. I just had to replace %20 to space
in query.addFilterQuery("timestamp:[151890840 TO 151891200]");

Thanks a ton! :)

On Mon, Feb 19, 2018 at 11:43 PM, Shawn Heisey 
wrote:

> On 2/19/2018 6:44 AM, Aakanksha Gupta wrote:
>
>> http://localhost:8983/solr/geoloc/select/?q=*:*={!geofilt
>> }=latlong=-6.08165,145.8612430=100=json=
>> timestamp:[151890840%20TO%20151891200]=*,_dist_:geodist()
>> 
>>
>> But I'm not sure how to build the SolrJ equivalent of this query using
>> SolrQuery.
>>
>
> I haven't done anything with spatial yet.  But I do know how to translate
> Solr URLs into SolrJ code.  The code below constructs a query object
> equivalent to that URL.  If that URL works as-is, this code should do the
> same.
>
> I did not include the "wt" parameter, which controls the format of the
> response.  With SolrJ, the transfer format defaults to binary and should
> not be changed.  It CAN be changed, but any other choice would be less
> efficient, and the programmer doesn't need to worry about it.
>
>   query.setQuery("*:*");
>   query.addFilterQuery("{!geofilt}");
> query.addFilterQuery("timestamp:[151890840%20TO%20151891200]");
>   query.set("sfield", "latlong");
>   query.set("pt", "-6.08165,145.8612430");
>   query.set("d", "100");
>   query.setFields("*", "_dist_:geodist()");
>
> I couldn't actually test this code, as I don't have any indexes with
> spatial data.
>
> Thanks,
> Shawn
>
>


Re: Help required with SolrJ

2018-02-19 Thread Shawn Heisey

On 2/19/2018 6:44 AM, Aakanksha Gupta wrote:

http://localhost:8983/solr/geoloc/select/?q=*:*={!geofilt}=latlong=-6.08165,145.8612430=100=json=timestamp:[151890840%20TO%20151891200]=*,_dist_:geodist()

But I'm not sure how to build the SolrJ equivalent of this query using
SolrQuery.


I haven't done anything with spatial yet.  But I do know how to 
translate Solr URLs into SolrJ code.  The code below constructs a query 
object equivalent to that URL.  If that URL works as-is, this code 
should do the same.


I did not include the "wt" parameter, which controls the format of the 
response.  With SolrJ, the transfer format defaults to binary and should 
not be changed.  It CAN be changed, but any other choice would be less 
efficient, and the programmer doesn't need to worry about it.


  query.setQuery("*:*");
  query.addFilterQuery("{!geofilt}");
query.addFilterQuery("timestamp:[151890840%20TO%20151891200]");
  query.set("sfield", "latlong");
  query.set("pt", "-6.08165,145.8612430");
  query.set("d", "100");
  query.setFields("*", "_dist_:geodist()");

I couldn't actually test this code, as I don't have any indexes with 
spatial data.


Thanks,
Shawn



Re: Help Required

2014-08-14 Thread Dmitry Kan
Thanks a lot, Shawn!

Dmitry


On Wed, Aug 13, 2014 at 4:22 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/13/2014 5:11 AM, Dmitry Kan wrote:
  OK, thanks. Can you please add my user name to the Contributor group?
 
  username: DmitryKan

 You are added.  Edit away!

 Thanks,
 Shawn




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Help Required

2014-08-13 Thread Dmitry Kan
OK, thanks. Can you please add my user name to the Contributor group?

username: DmitryKan



On Tue, Aug 12, 2014 at 5:41 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/12/2014 3:57 AM, Dmitry Kan wrote:
  Hi,
 
  is http://wiki.apache.org/solr/Support page immutable?

 All pages on that wiki are changeable by end users.  You just need to
 create an account on the wiki and then ask on this list to have your
 wiki username added to the Contributor group.

 Thanks,
 Shawn




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Help Required

2014-08-13 Thread Shawn Heisey
On 8/13/2014 5:11 AM, Dmitry Kan wrote:
 OK, thanks. Can you please add my user name to the Contributor group?
 
 username: DmitryKan

You are added.  Edit away!

Thanks,
Shawn



Re: Help Required

2014-08-12 Thread Dmitry Kan
Hi,

is http://wiki.apache.org/solr/Support page immutable?

Dmitry


On Fri, Aug 8, 2014 at 4:24 PM, Jack Krupansky j...@basetechnology.com
wrote:

 And the Solr Support list is where people register their available
 consulting services:
 http://wiki.apache.org/solr/Support

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Friday, August 8, 2014 9:12 AM
 To: solr-user
 Subject: Re: Help Required


 We don't mediate jobs offers/positions on this list. We help people to
 learn how to make these kinds of things yourself. If you are a
 developer, you may find that it would take only several days to get a
 strong feel for Solr. Especially, if you start from tutorials/right
 books.

 To find developers, using the normal job boards would probably be more
 efficient. That way you can list location, salary, timelines, etc.

 Regards,
   Alex.
 P.s. CityPantry does not actually seem to do what you are asking. They
 are starting from postcode, though possibly use the geodistance
 sorting afterwards.
 P.p.s. Yes, Solr can help with distance-based sorting.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On Fri, Aug 8, 2014 at 11:36 AM, INGRID MARSH
 ingridma...@btinternet.com wrote:

 Dear Sirs,

 I wonder if you can help me?

 I'm looking for a developer who uses Solr to build for me a facted seach
 facilty using location. In a nutshell, I need this funtionality as in here:

 www.citypantry.com
 wwwdinein.

 Here the vendor via google maps enters the area/radius they cover which
 enable the user to enter their postcode and be presented with the users who
 serve/cover their area. Is this what solr does?

 can you put me in touch with small developers who can help?

 Thanks so much.


 Ingrid Marsh





-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Help Required

2014-08-12 Thread Shawn Heisey
On 8/12/2014 3:57 AM, Dmitry Kan wrote:
 Hi,

 is http://wiki.apache.org/solr/Support page immutable?

All pages on that wiki are changeable by end users.  You just need to
create an account on the wiki and then ask on this list to have your
wiki username added to the Contributor group.

Thanks,
Shawn



Re: Help Required

2014-08-08 Thread Alexandre Rafalovitch
We don't mediate jobs offers/positions on this list. We help people to
learn how to make these kinds of things yourself. If you are a
developer, you may find that it would take only several days to get a
strong feel for Solr. Especially, if you start from tutorials/right
books.

To find developers, using the normal job boards would probably be more
efficient. That way you can list location, salary, timelines, etc.

Regards,
   Alex.
P.s. CityPantry does not actually seem to do what you are asking. They
are starting from postcode, though possibly use the geodistance
sorting afterwards.
P.p.s. Yes, Solr can help with distance-based sorting.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Aug 8, 2014 at 11:36 AM, INGRID MARSH
ingridma...@btinternet.com wrote:
 Dear Sirs,

 I wonder if you can help me?

 I'm looking for a developer who uses Solr to build for me a facted seach 
 facilty using location. In a nutshell, I need this funtionality as in here:

 www.citypantry.com
 wwwdinein.

 Here the vendor via google maps enters the area/radius they cover which 
 enable the user to enter their postcode and be presented with the users who 
 serve/cover their area. Is this what solr does?

 can you put me in touch with small developers who can help?

 Thanks so much.


 Ingrid Marsh


Re: Help Required

2014-08-08 Thread Jack Krupansky
And the Solr Support list is where people register their available 
consulting services:

http://wiki.apache.org/solr/Support

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Friday, August 8, 2014 9:12 AM
To: solr-user
Subject: Re: Help Required

We don't mediate jobs offers/positions on this list. We help people to
learn how to make these kinds of things yourself. If you are a
developer, you may find that it would take only several days to get a
strong feel for Solr. Especially, if you start from tutorials/right
books.

To find developers, using the normal job boards would probably be more
efficient. That way you can list location, salary, timelines, etc.

Regards,
  Alex.
P.s. CityPantry does not actually seem to do what you are asking. They
are starting from postcode, though possibly use the geodistance
sorting afterwards.
P.p.s. Yes, Solr can help with distance-based sorting.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Fri, Aug 8, 2014 at 11:36 AM, INGRID MARSH
ingridma...@btinternet.com wrote:

Dear Sirs,

I wonder if you can help me?

I'm looking for a developer who uses Solr to build for me a facted seach 
facilty using location. In a nutshell, I need this funtionality as in 
here:


www.citypantry.com
wwwdinein.

Here the vendor via google maps enters the area/radius they cover which 
enable the user to enter their postcode and be presented with the users 
who serve/cover their area. Is this what solr does?


can you put me in touch with small developers who can help?

Thanks so much.


Ingrid Marsh 




Re: Help required with fq syntax

2013-06-09 Thread Kamal Palei
Hi Otis
Your suggestion worked fine.

Thanks
kamal


On Sun, Jun 9, 2013 at 7:58 AM, Kamal Palei palei.ka...@gmail.com wrote:

 Though the syntax looks fine, but I get all the records. As per example
 given above I get all the documents, meaning filtering did not work. I am
 curious to know if my indexing went fine or not. I will check and revert
 back.


 On Sun, Jun 9, 2013 at 7:21 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

 Try:

 ...q=*:*fq=-blocked_company_ids:5

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Sat, Jun 8, 2013 at 9:37 PM, Kamal Palei palei.ka...@gmail.com
 wrote:
  Dear All
  I have a multi-valued field blocked_company_ids in index.
 
  You can think like
 
  1. document1 , blocked_company_ids: 1, 5, 7
  2. document2 , blocked_company_ids: 2, 6, 7
  3. document3 , blocked_company_ids: 4, 5, 6
 
  and so on .
 
  If I want to retrieve all the documents  where blocked_company_id does
 not
  contain one particular company id say 5.
 
  So my search result should give me only document2 as document1 and
  document3 both contains 5.
 
  To achieve this how fq syntax looks like is it something like below
 
  fq=blocked_company_ids:-5
 
  I tried like above syntax, but it gives me 0 record.
 
  Can somebody help me with the syntax please, and point me where all
 syntax
  details are given.
 
  Thanks
  Kamal
  Net Cloud Systems





Re: Help required with fq syntax

2013-06-08 Thread Otis Gospodnetic
Try:

...q=*:*fq=-blocked_company_ids:5

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Sat, Jun 8, 2013 at 9:37 PM, Kamal Palei palei.ka...@gmail.com wrote:
 Dear All
 I have a multi-valued field blocked_company_ids in index.

 You can think like

 1. document1 , blocked_company_ids: 1, 5, 7
 2. document2 , blocked_company_ids: 2, 6, 7
 3. document3 , blocked_company_ids: 4, 5, 6

 and so on .

 If I want to retrieve all the documents  where blocked_company_id does not
 contain one particular company id say 5.

 So my search result should give me only document2 as document1 and
 document3 both contains 5.

 To achieve this how fq syntax looks like is it something like below

 fq=blocked_company_ids:-5

 I tried like above syntax, but it gives me 0 record.

 Can somebody help me with the syntax please, and point me where all syntax
 details are given.

 Thanks
 Kamal
 Net Cloud Systems


Re: Help required with fq syntax

2013-06-08 Thread Kamal Palei
Also please note that for some documents, blocked_company_ids may not be
present as well. In such cases that document should be present in search
result as well.

BR,
Kamal


On Sun, Jun 9, 2013 at 7:07 AM, Kamal Palei palei.ka...@gmail.com wrote:

 Dear All
 I have a multi-valued field blocked_company_ids in index.

 You can think like

 1. document1 , blocked_company_ids: 1, 5, 7
 2. document2 , blocked_company_ids: 2, 6, 7
 3. document3 , blocked_company_ids: 4, 5, 6

 and so on .

 If I want to retrieve all the documents  where blocked_company_id does not
 contain one particular company id say 5.

 So my search result should give me only document2 as document1 and
 document3 both contains 5.

 To achieve this how fq syntax looks like is it something like below

 fq=blocked_company_ids:-5

 I tried like above syntax, but it gives me 0 record.

 Can somebody help me with the syntax please, and point me where all syntax
 details are given.

 Thanks
 Kamal
 Net Cloud Systems




Re: Help required with fq syntax

2013-06-08 Thread Kamal Palei
Though the syntax looks fine, but I get all the records. As per example
given above I get all the documents, meaning filtering did not work. I am
curious to know if my indexing went fine or not. I will check and revert
back.


On Sun, Jun 9, 2013 at 7:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Try:

 ...q=*:*fq=-blocked_company_ids:5

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Sat, Jun 8, 2013 at 9:37 PM, Kamal Palei palei.ka...@gmail.com wrote:
  Dear All
  I have a multi-valued field blocked_company_ids in index.
 
  You can think like
 
  1. document1 , blocked_company_ids: 1, 5, 7
  2. document2 , blocked_company_ids: 2, 6, 7
  3. document3 , blocked_company_ids: 4, 5, 6
 
  and so on .
 
  If I want to retrieve all the documents  where blocked_company_id does
 not
  contain one particular company id say 5.
 
  So my search result should give me only document2 as document1 and
  document3 both contains 5.
 
  To achieve this how fq syntax looks like is it something like below
 
  fq=blocked_company_ids:-5
 
  I tried like above syntax, but it gives me 0 record.
 
  Can somebody help me with the syntax please, and point me where all
 syntax
  details are given.
 
  Thanks
  Kamal
  Net Cloud Systems



Re: help required: how to design a large scale solr system

2008-09-24 Thread Mark Miller

From my limited experience:

I think you might have a bit of trouble getting 60 mil docs on a single 
machine. Cached queries will probably still be *very* fast, but non 
cached queries are going to be very slow in many cases. Is that 5 
seconds for all queries? You will never meet that on first run queries 
with 60mil docs on that machine. The light query load might make things 
workable...but your near the limits of a single machine (4 core or not) 
with 60 mil. You want to use a very good stopword list...common term 
queries will be killer. The docs being so small will be your only 
possible savior if you go the one machine route - that and cached hits. 
You don't have enough ram to get as much of the filesystem into RAM as 
youd like for 60 mil docs either.


I think you might try two machines with 30, 3 with 20, or 4 with 15. The 
more you spread, even with slower machines, the faster your likely to 
index, which as you say, will take a long time for 60 mil docs (start 
today g). Multiple machines will help the indexing speed the most for 
sure - its still going to take a long time.


I don't think you will get much advantage using more than one solr 
install on a single machine - if you do, that should be addressed in the 
code, even with RAID.


So I say, spread if you can. Faster indexing, faster search, easy to 
expand later. Distributed search is so easy with solr 1.3, you wont 
regret it. I think there is a bug to be addressed if your needing this 
in a week though - in my experience, with distributed search, for every 
million docs on a machine beyond the first, you lose a doc in a search 
across all machines (ie 1 mil on machine 1, 1 million on machine 2, a 
*:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* 
search will be missing 30. Not a big deal, but could be a concern for 
some with picky, look at everything customers.


- Mark

Ben Shlomo, Yatir wrote:

Hi!

I am already using solr 1.2 and happy with it.

In a new project with very tight dead line (10 development days from
today) I need to setup a more ambitious system in terms of scale
Here is the spec:

 


* I need to index about 60,000,000
documents 


* Each document is has 11 textual fields to be indexed  stored
and 4 more fields to be stored only 


* Most fields are short (2-14 characters) however 2 indexed
fields can be up to 1KB and another stored field is up to 1KB 


* On average every document is about 0.5 KB to be stored and
0.4KB to be indexed 


* The SLA for data freshness is a full nightly re-index ( I
cannot obtain an incremental update/delete lists of the modified
documents) 

* The SLA for query time is 5 seconds 

* the number of expected queries is 2-3 queries per second 


* the queries are simple a combination of Boolean operation and
name searches (no fancy fuzzy searches and levinstien distances, no
faceting, etc) 


* I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores ) with
RAID 10, 200 GB HD space, and 8GB RAM memory 


* The documents are not given to me explicitly - I am given a
raw-documents in RAM - one by one, from which I create my document in
RAM.
and then I can either http-post is to index it directly or append it to
a tsv file for later indexing 


* Each document has a unique ID

 


I have a few directions I am thinking about

 


The simple approach

* Have one solr instance that will index
the entire document set (from files). I am afraid this will take too
much time

 


Direction 1

* Create TSV files from all the
documents - this will take around 3-4 hours 


* Have all the documents partitioned
into several subsets (how many should I choose? ) 


* Have multiple solr instances on the
same machine 


* Let each solr instance concurrently
index the appropriate subset 


* At the end merge all the indices using
the IndexMergeTool - (how much time will it take ?)

 


Direction 2

* Like  the previous but instead of
using the IndexMergeTool , use distributed search with shards (upgrading
to solr 1.3)

 


Direction 3,4

* Like previous directions only avoid
using TSV files at all and directly index the documents from RAM

Questions:

* Which direction do you recommend in order to meet the SLAs in
the fastest way? 


* Since I have RAID on the machine can I gain performance by
using multiple solr instances on the same machine or only multiple
machines will help me 


* What's the minimal number of machines I should require (I
might get more weaker machines) 

* How many concurrent indexers are recommended? 


* Do you agree that the bottle neck is the indexing time?

Any 

RE: help required: how to design a large scale solr system

2008-09-24 Thread Ben Shlomo, Yatir
Thanks Mark!.
Do you have any comment regarding the performance differences between
indexing TSV files as opposed to directly indexing each document via
http post?

-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: help required: how to design a large scale solr system


 From my limited experience:

I think you might have a bit of trouble getting 60 mil docs on a single 
machine. Cached queries will probably still be *very* fast, but non 
cached queries are going to be very slow in many cases. Is that 5 
seconds for all queries? You will never meet that on first run queries 
with 60mil docs on that machine. The light query load might make things 
workable...but your near the limits of a single machine (4 core or not) 
with 60 mil. You want to use a very good stopword list...common term 
queries will be killer. The docs being so small will be your only 
possible savior if you go the one machine route - that and cached hits. 
You don't have enough ram to get as much of the filesystem into RAM as 
youd like for 60 mil docs either.

I think you might try two machines with 30, 3 with 20, or 4 with 15. The

more you spread, even with slower machines, the faster your likely to 
index, which as you say, will take a long time for 60 mil docs (start 
today g). Multiple machines will help the indexing speed the most for 
sure - its still going to take a long time.

I don't think you will get much advantage using more than one solr 
install on a single machine - if you do, that should be addressed in the

code, even with RAID.

So I say, spread if you can. Faster indexing, faster search, easy to 
expand later. Distributed search is so easy with solr 1.3, you wont 
regret it. I think there is a bug to be addressed if your needing this 
in a week though - in my experience, with distributed search, for every 
million docs on a machine beyond the first, you lose a doc in a search 
across all machines (ie 1 mil on machine 1, 1 million on machine 2, a 
*:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* 
search will be missing 30. Not a big deal, but could be a concern for 
some with picky, look at everything customers.

- Mark

Ben Shlomo, Yatir wrote:
 Hi!

 I am already using solr 1.2 and happy with it.

 In a new project with very tight dead line (10 development days from
 today) I need to setup a more ambitious system in terms of scale
 Here is the spec:

  

 * I need to index about 60,000,000
 documents 

 * Each document is has 11 textual fields to be indexed 
stored
 and 4 more fields to be stored only 

 * Most fields are short (2-14 characters) however 2 indexed
 fields can be up to 1KB and another stored field is up to 1KB 

 * On average every document is about 0.5 KB to be stored and
 0.4KB to be indexed 

 * The SLA for data freshness is a full nightly re-index ( I
 cannot obtain an incremental update/delete lists of the modified
 documents) 

 * The SLA for query time is 5 seconds 

 * the number of expected queries is 2-3 queries per second 

 * the queries are simple a combination of Boolean operation
and
 name searches (no fancy fuzzy searches and levinstien distances, no
 faceting, etc) 

 * I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores )
with
 RAID 10, 200 GB HD space, and 8GB RAM memory 

 * The documents are not given to me explicitly - I am given a
 raw-documents in RAM - one by one, from which I create my document in
 RAM.
 and then I can either http-post is to index it directly or append it
to
 a tsv file for later indexing 

 * Each document has a unique ID

  

 I have a few directions I am thinking about

  

 The simple approach

 * Have one solr instance that will
index
 the entire document set (from files). I am afraid this will take too
 much time

  

 Direction 1

 * Create TSV files from all the
 documents - this will take around 3-4 hours 

 * Have all the documents partitioned
 into several subsets (how many should I choose? ) 

 * Have multiple solr instances on the
 same machine 

 * Let each solr instance concurrently
 index the appropriate subset 

 * At the end merge all the indices
using
 the IndexMergeTool - (how much time will it take ?)

  

 Direction 2

 * Like  the previous but instead of
 using the IndexMergeTool , use distributed search with shards
(upgrading
 to solr 1.3)

  

 Direction 3,4

 * Like previous directions only avoid
 using TSV files at all and directly index the documents from RAM

 Questions:

 * Which direction do you recommend in order to meet

Re: help required: how to design a large scale solr system

2008-09-24 Thread Martin Iwanowski

Hi,

I'm very new to search engines in general.
I've been using Zend_Search_Lucene class before to try Lucene in  
general and though it surely works it's not what I'm looking for  
performance wise.


I recently installed Solr on a newly installed Ubuntu (Hardy Heron)  
machine.


I have about 207k docs (currently, and I'm getting about 100k each  
month from now on) and that's why I decided to throw myself into  
something real for once.


As I'm learning from today, I was wondering two main things.
I'm using Jetty as the Java container, and PHP5 to handle the search- 
requests from an agent.


If I start Solr using java -jar start.jar in the example directory,  
everything works fine. I even manage to populate the index with the  
example data as documented in the tutorials.


How can I setup to run Solr as a service, so I don't need to have a  
SSH connection open?

Sorry for being stupid here btw.

I'm working to have a multi-langual search. So a company (doc) exists  
in say Poland, what design of scheme should I read/work on to be able  
to write Poland/Polen/Polska (Poland in different languages) and still  
hit the same results. I have the data from geonames.org for this, but  
I can't really grasp how I should be working the scheme.xml. The  
easiest solution would be to populate each document with each possible  
hit word, but this would give me a bunch of duplicates.


Yours,
Martin Iwanowski


Re: help required: how to design a large scale solr system

2008-09-24 Thread Mark Miller
Yes. You will def see a speed increasing by avoiding http (especially 
doc at a time http) and using the direct csv loader.


http://wiki.apache.org/solr/UpdateCSV

- Mark

Ben Shlomo, Yatir wrote:

Thanks Mark!.
Do you have any comment regarding the performance differences between
indexing TSV files as opposed to directly indexing each document via
http post?

-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 2:12 PM

To: solr-user@lucene.apache.org
Subject: Re: help required: how to design a large scale solr system


 From my limited experience:

I think you might have a bit of trouble getting 60 mil docs on a single 
machine. Cached queries will probably still be *very* fast, but non 
cached queries are going to be very slow in many cases. Is that 5 
seconds for all queries? You will never meet that on first run queries 
with 60mil docs on that machine. The light query load might make things 
workable...but your near the limits of a single machine (4 core or not) 
with 60 mil. You want to use a very good stopword list...common term 
queries will be killer. The docs being so small will be your only 
possible savior if you go the one machine route - that and cached hits. 
You don't have enough ram to get as much of the filesystem into RAM as 
youd like for 60 mil docs either.


I think you might try two machines with 30, 3 with 20, or 4 with 15. The

more you spread, even with slower machines, the faster your likely to 
index, which as you say, will take a long time for 60 mil docs (start 
today g). Multiple machines will help the indexing speed the most for 
sure - its still going to take a long time.


I don't think you will get much advantage using more than one solr 
install on a single machine - if you do, that should be addressed in the


code, even with RAID.

So I say, spread if you can. Faster indexing, faster search, easy to 
expand later. Distributed search is so easy with solr 1.3, you wont 
regret it. I think there is a bug to be addressed if your needing this 
in a week though - in my experience, with distributed search, for every 
million docs on a machine beyond the first, you lose a doc in a search 
across all machines (ie 1 mil on machine 1, 1 million on machine 2, a 
*:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* 
search will be missing 30. Not a big deal, but could be a concern for 
some with picky, look at everything customers.


- Mark

Ben Shlomo, Yatir wrote:
  

Hi!

I am already using solr 1.2 and happy with it.

In a new project with very tight dead line (10 development days from
today) I need to setup a more ambitious system in terms of scale
Here is the spec:

 


* I need to index about 60,000,000
documents 


* Each document is has 11 textual fields to be indexed 


stored
  
and 4 more fields to be stored only 


* Most fields are short (2-14 characters) however 2 indexed
fields can be up to 1KB and another stored field is up to 1KB 


* On average every document is about 0.5 KB to be stored and
0.4KB to be indexed 


* The SLA for data freshness is a full nightly re-index ( I
cannot obtain an incremental update/delete lists of the modified
documents) 

* The SLA for query time is 5 seconds 

* the number of expected queries is 2-3 queries per second 


* the queries are simple a combination of Boolean operation


and
  

name searches (no fancy fuzzy searches and levinstien distances, no
faceting, etc) 


* I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores )


with
  
RAID 10, 200 GB HD space, and 8GB RAM memory 


* The documents are not given to me explicitly - I am given a
raw-documents in RAM - one by one, from which I create my document in
RAM.
and then I can either http-post is to index it directly or append it


to
  
a tsv file for later indexing 


* Each document has a unique ID

 


I have a few directions I am thinking about

 


The simple approach

* Have one solr instance that will


index
  

the entire document set (from files). I am afraid this will take too
much time

 


Direction 1

* Create TSV files from all the
documents - this will take around 3-4 hours 


* Have all the documents partitioned
into several subsets (how many should I choose? ) 


* Have multiple solr instances on the
same machine 


* Let each solr instance concurrently
index the appropriate subset 


* At the end merge all the indices


using
  

the IndexMergeTool - (how much time will it take ?)

 


Direction 2

* Like  the previous but instead of
using the IndexMergeTool , use distributed search with shards


(upgrading
  

to solr 1.3

Re: help required: how to design a large scale solr system

2008-09-24 Thread Norberto Meijome
On Wed, 24 Sep 2008 07:46:57 -0400
Mark Miller [EMAIL PROTECTED] wrote:

 Yes. You will def see a speed increasing by avoiding http (especially 
 doc at a time http) and using the direct csv loader.
 
 http://wiki.apache.org/solr/UpdateCSV

and the obvious reason that if, for whatever reason, something breaks while you
are indexing directly from memory, can you restart the import? it may be just
easier to keep in disk and keep track of where you are up to adding to the
index...
B
_
{Beto|Norberto|Numard} Meijome

Sysadmins can't be sued for malpractice, but surgeons don't have to deal with
patients who install new versions of their own innards.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: help required: how to design a large scale solr system

2008-09-24 Thread Mark Miller

Norberto Meijome wrote:

On Wed, 24 Sep 2008 07:46:57 -0400
Mark Miller [EMAIL PROTECTED] wrote:

  
Yes. You will def see a speed increasing by avoiding http (especially 
doc at a time http) and using the direct csv loader.


http://wiki.apache.org/solr/UpdateCSV



and the obvious reason that if, for whatever reason, something breaks while you
are indexing directly from memory, can you restart the import? it may be just
easier to keep in disk and keep track of where you are up to adding to the
index...
B
_
{Beto|Norberto|Numard} Meijome

Sysadmins can't be sued for malpractice, but surgeons don't have to deal with
patients who install new versions of their own innards.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.
  
Nothing to stop you from breaking up the tsv/csv files into multiple 
tsv/csv files.


Re: help required: how to design a large scale solr system

2008-09-24 Thread Norberto Meijome
On Wed, 24 Sep 2008 11:45:34 -0400
Mark Miller [EMAIL PROTECTED] wrote:

 Nothing to stop you from breaking up the tsv/csv files into multiple 
 tsv/csv files.

Absolutely agreeing with you ... in one system where I implemented  SOLR, I
have a process run through the file system and lazily pick up new files as they
come in.. if something breaks (and it will,as the files are user generated in
many cases...), report it / leave it for later...move on. 

b

_
{Beto|Norberto|Numard} Meijome

I used to hate weddings; all the Grandmas would poke me and
say, You're next sonny! They stopped doing that when i
started to do it to them at funerals.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: help required: how to design a large scale solr system

2008-09-24 Thread Otis Gospodnetic
Yatir,

I actually think you may be OK with a single machine for 60M docs, though.
You should be able to quickly do a test where you use SolrJ to post to Solr and 
get docs/second.

Use SOlr 1.3.  Use 2-3 indexing threads going against a single Solr instance.  
Increase the buffer size param and increase mergeFactor slightly.  Then 
determine docs/sec and see if that's high enough for you.  I'll bet it will be 
enough, unless you have some crazy analyzers.

TSVs will be faster, but if it takes you 3-4 hours to assemble them every 
night, the overall time may not be shorter.

But this is just indexing.  You may want to copy the index to a different 
box(es) for searching, as you don't wnat the high indexing IO to affect 
searching.

Your QPS is low and 5 sec for query latency should give you plenty of room.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Ben Shlomo, Yatir [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, September 24, 2008 2:50:54 AM
 Subject: help required: how to design a large scale solr system
 
 Hi!
 
 I am already using solr 1.2 and happy with it.
 
 In a new project with very tight dead line (10 development days from
 today) I need to setup a more ambitious system in terms of scale
 Here is the spec:
 
 
 
 * I need to index about 60,000,000
 documents 
 
 * Each document is has 11 textual fields to be indexed  stored
 and 4 more fields to be stored only 
 
 * Most fields are short (2-14 characters) however 2 indexed
 fields can be up to 1KB and another stored field is up to 1KB 
 
 * On average every document is about 0.5 KB to be stored and
 0.4KB to be indexed 
 
 * The SLA for data freshness is a full nightly re-index ( I
 cannot obtain an incremental update/delete lists of the modified
 documents) 
 
 * The SLA for query time is 5 seconds 
 
 * the number of expected queries is 2-3 queries per second 
 
 * the queries are simple a combination of Boolean operation and
 name searches (no fancy fuzzy searches and levinstien distances, no
 faceting, etc) 
 
 * I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores ) with
 RAID 10, 200 GB HD space, and 8GB RAM memory 
 
 * The documents are not given to me explicitly - I am given a
 raw-documents in RAM - one by one, from which I create my document in
 RAM.
 and then I can either http-post is to index it directly or append it to
 a tsv file for later indexing 
 
 * Each document has a unique ID
 
 
 
 I have a few directions I am thinking about
 
 
 
 The simple approach
 
 * Have one solr instance that will index
 the entire document set (from files). I am afraid this will take too
 much time
 
 
 
 Direction 1
 
 * Create TSV files from all the
 documents - this will take around 3-4 hours 
 
 * Have all the documents partitioned
 into several subsets (how many should I choose? ) 
 
 * Have multiple solr instances on the
 same machine 
 
 * Let each solr instance concurrently
 index the appropriate subset 
 
 * At the end merge all the indices using
 the IndexMergeTool - (how much time will it take ?)
 
 
 
 Direction 2
 
 * Like  the previous but instead of
 using the IndexMergeTool , use distributed search with shards (upgrading
 to solr 1.3)
 
 
 
 Direction 3,4
 
 * Like previous directions only avoid
 using TSV files at all and directly index the documents from RAM
 
 Questions:
 
 * Which direction do you recommend in order to meet the SLAs in
 the fastest way? 
 
 * Since I have RAID on the machine can I gain performance by
 using multiple solr instances on the same machine or only multiple
 machines will help me 
 
 * What's the minimal number of machines I should require (I
 might get more weaker machines) 
 
 * How many concurrent indexers are recommended? 
 
 * Do you agree that the bottle neck is the indexing time?
 
 Any help is appreciated 
 
 Thanks in advance
 
 yatir



Re: help required: how to design a large scale solr system

2008-09-24 Thread Jon Drukman

Martin Iwanowski wrote:
How can I setup to run Solr as a service, so I don't need to have a SSH 
connection open?


The advice that I was given on this very list was to use daemontools.  I 
set it up and it is really great - starts when the machine boots, 
auto-restart on failures, easy to bring up/down on demand.  Search the 
archive for my post on the subject, I explained how to set it up in detail.


(I've also had success using launchd to manage Solr on Mac OS X in case 
anyone wants to try running it on their desktop.)


-jsd-



Re: Help required with external value source SOLR-351

2008-04-27 Thread Koji Sekiguchi

Howard,

I think up two things:

1. double check external_cpc file is in D:/solr1/data there
   and post commit/ to let solr read it.
2. DisMax query doesn't support job_id:4901708 _val_:cpc format
   for query string. Just try q=cpc and see explain.

Thank you,

Koji

Howard Lee wrote:

Help required with external value source SOLR-351

I'm trying to get this new feature to work without much success. I've
completed the following steps.

1) dowloaded latest nightly build
2) added the following to schema.xml
fieldtype name=file keyField=job_id defVal=1 stored=false
indexed=false class=solr.ExternalFileField valType=float/

and

field name=cpc type=file/

3) Created a file in the solr index folder - external_cpc with the
following entries
4901708=10
4901715=20

The ids correspond to job_id ids in the index.

when I run a query _val_:cpc the max score just corresponds to the defval 1.
It doesn't seem to be picking up anything from the external file.

from a query

job_id:4901708  _val_:cpc

In the explain I get

FunctionQuery(FileFloatSource(field=cpc,keyField=job_id,defVal=1.0,dataDir=D:/solr1/data/)),
product of:
1.0 = float(cpc{type=file,properties=})=1.0
1.0 = boost

what am I doing wrong?

Thanks

Howard