Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine

Hi,

I am trying Solr and i have one question. In the schema that i set up, 
there are 10 fields with always same data(hierarchical taxonomies) but 
with 4 million
documents, space disk and indexing time must be big. I need this field 
for auto complete. Is there another way to do this type of operation ?


Damien


Re: Taxonomy in SOLR

2011-01-24 Thread Em

Hi Damien,

can you provide a schema sample plus example-data?
Since your information is really general, I think no one can give you a
situation-specific advice.

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine

My schema :

field name=id type=string indexed=true stored=true required=true /

!-- Document --
field name=lead type=string indexed=true stored=true /
field name=title type=string indexed=true stored=true required=true /
field name=text type=string indexed=true stored=true required=true /

!-- taxo --
dynamicField name=*_taxon_label type=string indexed=true stored=true /
dynamicField name=*_taxon_type type=string indexed=true stored=true /
dynamicField name=*_taxon_hierarchy type=string indexed=true stored=true 
multiValued=true /

field name=type type=string indexed=true stored=true required=true /


Le 24/01/2011 09:56, Em a écrit :

Hi Damien,

can you provide a schema sample plus example-data?
Since your information is really general, I think no one can give you a
situation-specific advice.

Regards




Re: Taxonomy in SOLR

2011-01-24 Thread Em

Hi Damien,

why are you storing the taxonomies?
When it comes to faceting, it only depends on indexed values. If there is a
meaningful difference between the indexed and the stored value, I would
prefer to use an RDBMs or something like that to reduce redundancy.

Does this help?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318363.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine

Yes, i am not obliged to store taxonomies.

My taxonomies are type of

english_taxon_label = Berlin
english_taxon_type = location
english_taxon_hierarchy = 0/world
  1/world/europe
  2/world/europe/germany
  3/world/europe/germany/berlin

I need *_taxon_hierarchy to faceting and label to auto complete.

With a RDBMs, i have 100 entry max for one taxo, but with solr and 4 
million documents the redundandcy is huge, no ?


And i have 10 different taxonomies per document 

Damien

Le 24/01/2011 10:30, Em a écrit :

Hi Damien,

why are you storing the taxonomies?
When it comes to faceting, it only depends on indexed values. If there is a
meaningful difference between the indexed and the stored value, I would
prefer to use an RDBMs or something like that to reduce redundancy.

Does this help?

Regards




Re: Taxonomy in SOLR

2011-01-24 Thread Em

100 Entries per taxon?
Well, with Solr you got 100 taxon-entries * 4mio docs * 10 taxons.
If your indexed taxon-versions are looking okay, you could leave out the
DB-overhead and could do everything in Solr.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2318550.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine

Thanks Em,

How i can calculate index time, update time and space disk used by one 
taxonomy ?


Le 24/01/2011 10:58, Em a écrit :

100 Entries per taxon?
Well, with Solr you got 100 taxon-entries * 4mio docs * 10 taxons.
If your indexed taxon-versions are looking okay, you could leave out the
DB-overhead and could do everything in Solr.






Re: Taxonomy in SOLR

2011-01-24 Thread Em

Hi Daniem,

ahm, the formula I wrote was no definitive guide, just some numbers I
combined to visualize the amount of data - perhaps not even a complete
formula.

Well, when you can use your taxonomy as indexed-only you do not double the
used disk space when you are indexing two equal documents.

Lucene - and also Solr - are working with an inverted index: This means
every document is mapped against its indexed terms.
So your index-size will depend on the number of unique taxonomy-terms and
the pointers of the documents to these terms. That's it. Usually the used
disk-space for an index is much smaller than the size of the original data.

I hope what I tried to explain was easy to understand.

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319202.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine



Le 24/01/2011 13:10, Em a écrit :

Hi Daniem,

ahm, the formula I wrote was no definitive guide, just some numbers I
combined to visualize the amount of data - perhaps not even a complete
formula.

Well, when you can use your taxonomy as indexed-only you do not double the
used disk space when you are indexing two equal documents.
So, five document or 4 mi with the same taxonomy are equal in using disk 
space to one ?



Lucene - and also Solr - are working with an inverted index: This means
every document is mapped against its indexed terms.
So your index-size will depend on the number of unique taxonomy-terms and
the pointers of the documents to these terms. That's it. Usually the used
disk-space for an index is much smaller than the size of the original data.

I hope what I tried to explain was easy to understand.

Thanks, it's very helpfull !

How i can find more explaination on the internal structure of the Lucene 
indexer ?


Damien


Re: Taxonomy in SOLR

2011-01-24 Thread Em

Just for illustration:

This is your original data:

doc1 : hello world
doc2: hello daniem
doc3: hello pal

Now, Lucene produces something like this from the input:
hello: id_doc1,id_doc2,id_doc3
daniem: id_doc2
pal: id_doc3

Well, it's more complex, but enough for illustration.
As you can see, the representation of a document is completly different.
A document costs only a few bytes for a Lucene-internal id per word.

If words occur more than one time per document AND you do not store
termVectors, Lucene just adds the number of occurences per word per doc to
its index:

hello: id_doc1[1],id_doc2[1],id_doc3[1]
daniem: id_doc2[1]
pal: id_doc3[1]

Imagine what happens to longer texts where especially stopwords or important
words occur more than one time.

I would suggest to start with the Lucene-Wiki, if you want to learn more
about Lucene.

Regards,
Em
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2319920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Erick Erickson
First, the redundancy is certainly there, but that's what Solr does, handles
large
amounts of data. 4 million documents is actually a pretty small corpus by
Solr
standards, so you may well be able to do exactly what you propose with
acceptable performance/size. I'd advise just trying it with, say, 200,000
docs.
Why 200K? because index growth is non-linear with the first bunch of
documents
taking up more space than the second. So index 100K, examine your indexes
and index 100K more. Now use the delta to extrapolate to 4M.

You don't need to store the taxonomy in each doc for auto-complete, you can
get your auto-completion from a different index. Or you can index your
taxonomies
in a special document in Solr and query the (unique) field in that
document for
autocomplete.

For faceting, you do need taxonomies. But remember that the nature of the
inverted index is that unique terms are only stored once, and the document
ID for each document that that term appears in is recorded. So if you have
3/europe/germany/berlin stored in 1M documents, your index space is really
string length + overhead + space for 1M ids.

Best
Erick

On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine dfonta...@rosebud.frwrote:

 Yes, i am not obliged to store taxonomies.

 My taxonomies are type of

 english_taxon_label = Berlin
 english_taxon_type = location
 english_taxon_hierarchy = 0/world
  1/world/europe
  2/world/europe/germany
  3/world/europe/germany/berlin

 I need *_taxon_hierarchy to faceting and label to auto complete.

 With a RDBMs, i have 100 entry max for one taxo, but with solr and 4
 million documents the redundandcy is huge, no ?

 And i have 10 different taxonomies per document 

 Damien

 Le 24/01/2011 10:30, Em a écrit :

  Hi Damien,

 why are you storing the taxonomies?
 When it comes to faceting, it only depends on indexed values. If there is
 a
 meaningful difference between the indexed and the stored value, I would
 prefer to use an RDBMs or something like that to reduce redundancy.

 Does this help?

 Regards





Re: Taxonomy in SOLR

2011-01-24 Thread Damien Fontaine

Thanks Em and Erick for your answers,

Now, i better understand functioning of Solr.

Damien

Le 24/01/2011 16:23, Erick Erickson a écrit :

First, the redundancy is certainly there, but that's what Solr does, handles
large
amounts of data. 4 million documents is actually a pretty small corpus by
Solr
standards, so you may well be able to do exactly what you propose with
acceptable performance/size. I'd advise just trying it with, say, 200,000
docs.
Why 200K? because index growth is non-linear with the first bunch of
documents
taking up more space than the second. So index 100K, examine your indexes
and index 100K more. Now use the delta to extrapolate to 4M.

You don't need to store the taxonomy in each doc for auto-complete, you can
get your auto-completion from a different index. Or you can index your
taxonomies
in a special document in Solr and query the (unique) field in that
document for
autocomplete.

For faceting, you do need taxonomies. But remember that the nature of the
inverted index is that unique terms are only stored once, and the document
ID for each document that that term appears in is recorded. So if you have
3/europe/germany/berlin stored in 1M documents, your index space is really
string length + overhead  +space for 1M ids.

Best
Erick

On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontainedfonta...@rosebud.frwrote:


Yes, i am not obliged to store taxonomies.

My taxonomies are type of

english_taxon_label = Berlin
english_taxon_type = location
english_taxon_hierarchy = 0/world
  1/world/europe
  2/world/europe/germany
  3/world/europe/germany/berlin

I need *_taxon_hierarchy to faceting and label to auto complete.

With a RDBMs, i have 100 entry max for one taxo, but with solr and 4
million documents the redundandcy is huge, no ?

And i have 10 different taxonomies per document 

Damien

Le 24/01/2011 10:30, Em a écrit :

  Hi Damien,

why are you storing the taxonomies?
When it comes to faceting, it only depends on indexed values. If there is
a
meaningful difference between the indexed and the stored value, I would
prefer to use an RDBMs or something like that to reduce redundancy.

Does this help?

Regards







Re: Taxonomy in SOLR

2011-01-24 Thread Em

Hi Erick,

in some usecases I really think that your suggestion with some
unique-documents for meta-information is a good approach to solve some
issues.
However there is a hurdle for me and maybe you can help me to clear it:

What is the best way to get such meta-data?
I see three possible approaches:
1st: get it in another request
2nd: get it with a requestHandler
3rd: get it with a searchComponent

I think the 2nd and 3rd are the cleanest ways.
But to make a decision between them I run into two problems:
RequestHandler: Should I extend the StandardRequestHandler to do what I
need? If so, I could just query my index for the needed information and add
it to the request before I pass it up the SearchComponents.

SearchComponent: The problem with the SearchComponent is the distributed
thing and how to test it. However, if this would be the cleanest way to go,
one should go it.

What would you do, if you want to add some meta-information to your request
that was not given by the user?

Regards,
Em


Erick Erickson wrote:
 
 First, the redundancy is certainly there, but that's what Solr does,
 handles
 large
 amounts of data. 4 million documents is actually a pretty small corpus by
 Solr
 standards, so you may well be able to do exactly what you propose with
 acceptable performance/size. I'd advise just trying it with, say, 200,000
 docs.
 Why 200K? because index growth is non-linear with the first bunch of
 documents
 taking up more space than the second. So index 100K, examine your indexes
 and index 100K more. Now use the delta to extrapolate to 4M.
 
 You don't need to store the taxonomy in each doc for auto-complete, you
 can
 get your auto-completion from a different index. Or you can index your
 taxonomies
 in a special document in Solr and query the (unique) field in that
 document for
 autocomplete.
 
 For faceting, you do need taxonomies. But remember that the nature of the
 inverted index is that unique terms are only stored once, and the document
 ID for each document that that term appears in is recorded. So if you have
 3/europe/germany/berlin stored in 1M documents, your index space is really
 string length + overhead + space for 1M ids.
 
 Best
 Erick
 
 On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine
 dfonta...@rosebud.frwrote:
 
 Yes, i am not obliged to store taxonomies.

 My taxonomies are type of

 english_taxon_label = Berlin
 english_taxon_type = location
 english_taxon_hierarchy = 0/world
  1/world/europe
  2/world/europe/germany
 
 3/world/europe/germany/berlin

 I need *_taxon_hierarchy to faceting and label to auto complete.

 With a RDBMs, i have 100 entry max for one taxo, but with solr and 4
 million documents the redundandcy is huge, no ?

 And i have 10 different taxonomies per document 

 Damien

 Le 24/01/2011 10:30, Em a écrit :

  Hi Damien,

 why are you storing the taxonomies?
 When it comes to faceting, it only depends on indexed values. If there
 is
 a
 meaningful difference between the indexed and the stored value, I would
 prefer to use an RDBMs or something like that to reduce redundancy.

 Does this help?

 Regards



 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Erick Erickson
I wasn't thinking about this for adding information to the *request*.
Rather, in this
case the autocomplete uses an Ajax call that just uses the TermsComponent
to get the autocomplete data and display it. This is just textual, so adding
it to the
request is client-side magic.

If you want your app to have access to the meta-data for other purposes,
you'd
just query and cache it from the app. You could use that to build up the
links
you embed in the page for new queries if you chose, no custom handlers
necessary.

Otherwise, I guess you'd create a custom request handler, that seems like a
reasonable place.

Best
Erick

On Mon, Jan 24, 2011 at 11:03 AM, Em mailformailingli...@yahoo.de wrote:


 Hi Erick,

 in some usecases I really think that your suggestion with some
 unique-documents for meta-information is a good approach to solve some
 issues.
 However there is a hurdle for me and maybe you can help me to clear it:

 What is the best way to get such meta-data?
 I see three possible approaches:
 1st: get it in another request
 2nd: get it with a requestHandler
 3rd: get it with a searchComponent

 I think the 2nd and 3rd are the cleanest ways.
 But to make a decision between them I run into two problems:
 RequestHandler: Should I extend the StandardRequestHandler to do what I
 need? If so, I could just query my index for the needed information and add
 it to the request before I pass it up the SearchComponents.

 SearchComponent: The problem with the SearchComponent is the distributed
 thing and how to test it. However, if this would be the cleanest way to go,
 one should go it.

 What would you do, if you want to add some meta-information to your request
 that was not given by the user?

 Regards,
 Em


 Erick Erickson wrote:
 
  First, the redundancy is certainly there, but that's what Solr does,
  handles
  large
  amounts of data. 4 million documents is actually a pretty small corpus by
  Solr
  standards, so you may well be able to do exactly what you propose with
  acceptable performance/size. I'd advise just trying it with, say, 200,000
  docs.
  Why 200K? because index growth is non-linear with the first bunch of
  documents
  taking up more space than the second. So index 100K, examine your indexes
  and index 100K more. Now use the delta to extrapolate to 4M.
 
  You don't need to store the taxonomy in each doc for auto-complete, you
  can
  get your auto-completion from a different index. Or you can index your
  taxonomies
  in a special document in Solr and query the (unique) field in that
  document for
  autocomplete.
 
  For faceting, you do need taxonomies. But remember that the nature of the
  inverted index is that unique terms are only stored once, and the
 document
  ID for each document that that term appears in is recorded. So if you
 have
  3/europe/germany/berlin stored in 1M documents, your index space is
 really
  string length + overhead + space for 1M ids.
 
  Best
  Erick
 
  On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine
  dfonta...@rosebud.frwrote:
 
  Yes, i am not obliged to store taxonomies.
 
  My taxonomies are type of
 
  english_taxon_label = Berlin
  english_taxon_type = location
  english_taxon_hierarchy = 0/world
   1/world/europe
   2/world/europe/germany
 
  3/world/europe/germany/berlin
 
  I need *_taxon_hierarchy to faceting and label to auto complete.
 
  With a RDBMs, i have 100 entry max for one taxo, but with solr and 4
  million documents the redundandcy is huge, no ?
 
  And i have 10 different taxonomies per document 
 
  Damien
 
  Le 24/01/2011 10:30, Em a écrit :
 
   Hi Damien,
 
  why are you storing the taxonomies?
  When it comes to faceting, it only depends on indexed values. If there
  is
  a
  meaningful difference between the indexed and the stored value, I would
  prefer to use an RDBMs or something like that to reduce redundancy.
 
  Does this help?
 
  Regards
 
 
 
 
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Taxonomy in SOLR

2011-01-24 Thread Em

Thank you for the advice, Erick!

I will take a look at extending the StandardRequestHandler for such
usecases.


Erick Erickson wrote:
 
 I wasn't thinking about this for adding information to the *request*.
 Rather, in this
 case the autocomplete uses an Ajax call that just uses the TermsComponent
 to get the autocomplete data and display it. This is just textual, so
 adding
 it to the
 request is client-side magic.
 
 If you want your app to have access to the meta-data for other purposes,
 you'd
 just query and cache it from the app. You could use that to build up the
 links
 you embed in the page for new queries if you chose, no custom handlers
 necessary.
 
 Otherwise, I guess you'd create a custom request handler, that seems like
 a
 reasonable place.
 
 Best
 Erick
 
 On Mon, Jan 24, 2011 at 11:03 AM, Em mailformailingli...@yahoo.de wrote:
 

 Hi Erick,

 in some usecases I really think that your suggestion with some
 unique-documents for meta-information is a good approach to solve some
 issues.
 However there is a hurdle for me and maybe you can help me to clear it:

 What is the best way to get such meta-data?
 I see three possible approaches:
 1st: get it in another request
 2nd: get it with a requestHandler
 3rd: get it with a searchComponent

 I think the 2nd and 3rd are the cleanest ways.
 But to make a decision between them I run into two problems:
 RequestHandler: Should I extend the StandardRequestHandler to do what I
 need? If so, I could just query my index for the needed information and
 add
 it to the request before I pass it up the SearchComponents.

 SearchComponent: The problem with the SearchComponent is the distributed
 thing and how to test it. However, if this would be the cleanest way to
 go,
 one should go it.

 What would you do, if you want to add some meta-information to your
 request
 that was not given by the user?

 Regards,
 Em


 Erick Erickson wrote:
 
  First, the redundancy is certainly there, but that's what Solr does,
  handles
  large
  amounts of data. 4 million documents is actually a pretty small corpus
 by
  Solr
  standards, so you may well be able to do exactly what you propose with
  acceptable performance/size. I'd advise just trying it with, say,
 200,000
  docs.
  Why 200K? because index growth is non-linear with the first bunch of
  documents
  taking up more space than the second. So index 100K, examine your
 indexes
  and index 100K more. Now use the delta to extrapolate to 4M.
 
  You don't need to store the taxonomy in each doc for auto-complete, you
  can
  get your auto-completion from a different index. Or you can index your
  taxonomies
  in a special document in Solr and query the (unique) field in that
  document for
  autocomplete.
 
  For faceting, you do need taxonomies. But remember that the nature of
 the
  inverted index is that unique terms are only stored once, and the
 document
  ID for each document that that term appears in is recorded. So if you
 have
  3/europe/germany/berlin stored in 1M documents, your index space is
 really
  string length + overhead + space for 1M ids.
 
  Best
  Erick
 
  On Mon, Jan 24, 2011 at 4:53 AM, Damien Fontaine
  dfonta...@rosebud.frwrote:
 
  Yes, i am not obliged to store taxonomies.
 
  My taxonomies are type of
 
  english_taxon_label = Berlin
  english_taxon_type = location
  english_taxon_hierarchy = 0/world
   1/world/europe
   2/world/europe/germany
 
  3/world/europe/germany/berlin
 
  I need *_taxon_hierarchy to faceting and label to auto complete.
 
  With a RDBMs, i have 100 entry max for one taxo, but with solr and 4
  million documents the redundandcy is huge, no ?
 
  And i have 10 different taxonomies per document 
 
  Damien
 
  Le 24/01/2011 10:30, Em a écrit :
 
   Hi Damien,
 
  why are you storing the taxonomies?
  When it comes to faceting, it only depends on indexed values. If
 there
  is
  a
  meaningful difference between the indexed and the stored value, I
 would
  prefer to use an RDBMs or something like that to reduce redundancy.
 
  Does this help?
 
  Regards
 
 
 
 
 

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2320666.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-in-SOLR-tp2317955p2321340.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Taxonomy in SOLR

2011-01-24 Thread Jonathan Rochkind
There aren't any great general purpose out of the box ways to handle 
hieararchical data in Solr.  Solr isn't an rdbms.


There may be some particular advice on how to set up a particular Solr 
index to answer particular questions with regard to hieararchical data.


I saw a great point made recently comparing rdbms to NoSQL stores, which 
applied to Solr too even though Solr is NOT a noSQL store.  In rdbms, 
you set up your schema thinking only about your _data_, and modelling 
your data as flexibly as possible. Then once you've done that, you can 
ask pretty much any well-specified question you want of your data, and 
get a correct and reasonably performant answer.


In Solr, on the other hand, we set up our schemas to answer particular 
questions. You have to first figure out what kinds of questions you will 
want to ask Solr, what kinds of queries you'll want to make, and then 
you can figure out how to structure your data to ask those questions.  
Some questions are actually very hard to set up Solr to answer -- in 
general Solr is about setting up your data so whatever question you have 
can be reduced to asking is token X in field Y.


This can be especially tricky in cases where you want to use a single 
Solr index to answer multiple questions, where the questions are such 
that you really need to set up your data _differently_ to get Solr to 
optimally answer each question.


Solr is not a general purpose store like an rdbms, where you can set up 
your schema once in terms of your data and use it to answer nearly any 
conceivable well-specified question after that.  Instead, Solr does 
things that rdbms can't do quickly or can't do at all.  But you lose 
some things too.


On 1/24/2011 3:03 AM, Damien Fontaine wrote:

Hi,

I am trying Solr and i have one question. In the schema that i set up,
there are 10 fields with always same data(hierarchical taxonomies) but
with 4 million
documents, space disk and indexing time must be big. I need this field
for auto complete. Is there another way to do this type of operation ?

Damien