Re: Category Hierarchy on Dynamic Fields - Solr 4.10

2015-07-06 Thread Mike L.

Hey Erick,
Thanks for the response. I haven't used PathHierarchyTokenizerFactory before...
Seems with this approach, I'd flatten all the categories pertaining to a 
product into one Solr field... I guess using the facet.prefix I wont lose the 
granularity I need to facet on individual categories and can maintain the 
hierarchy..
I'll take a look at it.
Thanks!

   From: Erick Erickson 
 To: solr-user@lucene.apache.org; Mike L.  
 Sent: Monday, July 6, 2015 12:42 PM
 Subject: Re: Category Hierarchy on Dynamic Fields - Solr 4.10
   
Hmmm, probably missing something here, but have you looked at
PathHierarchyTokenizerFactory?

In essence, it indexes each sub-path as a token, which makes lots
of faceting tasks easier. I.e.
lev1/lev2/lev3
gets tokenized as three tokens
lev1/lev2/lev3
lev1/lev2
lev1

If that works, I'm not quite sure why you need dynamic fields here, but
then I only skimmed.

Best,
Erick



On Mon, Jul 6, 2015 at 10:33 AM, Mike L.  wrote:
>
> Solr User Group -
>    Was wondering if anybody had any suggestions/best practices around a 
>requirement for storing a dynamic category structure that needs to have the 
>ability to facet on and maintain its hierarchy
> Some context:
>
> A product could belong to an undetermined amount of product categories that 
> contain a logical hierarchy. In other-wards, depending on a vendor being 
> used, each product could contain anywhere from 1-N levels deep of product 
> categories. These categories do have a hierarchy that would need to be 
> maintained such that the website could drill down those category facets to 
> find the appropriate product.
>
> Examples:Product A is associated to : Category , Subcategory 1, Subcategory 
> 2, Subcategory 3, Subcategory 4
> Product B is associated to: Category , subcategory 1, Subcategory 2Product C 
> is associated to: Category, subcategory 1etc.
> If the category structure was a bit more predictable, it would be easy to 
> load the data into Solr and understand from a UI perspective how best to 
> create a faceted hierarchy.
>
> However, because this category structure is dynamic and different for each 
> product, I'm trying to plan for the best course of action in terms of how to 
> manage this data into Solr and be able to provide category facets and guide 
> the UI to how best to query the data in the appropriate hierarchy.
> I'm thinking of loading the category structure using dynamic fields, but any 
> good approaches on how best to facet and derive a hierarchy on those dynamic 
> fields? Or other thoughts around this.
>
> Hope that makes sense.
> Thanks,

  

Category Hierarchy on Dynamic Fields - Solr 4.10

2015-07-06 Thread Mike L.

Solr User Group -
    Was wondering if anybody had any suggestions/best practices around a 
requirement for storing a dynamic category structure that needs to have the 
ability to facet on and maintain its hierarchy
Some context:

A product could belong to an undetermined amount of product categories that 
contain a logical hierarchy. In other-wards, depending on a vendor being used, 
each product could contain anywhere from 1-N levels deep of product categories. 
These categories do have a hierarchy that would need to be maintained such that 
the website could drill down those category facets to find the appropriate 
product.

Examples:Product A is associated to : Category , Subcategory 1, Subcategory 2, 
Subcategory 3, Subcategory 4
Product B is associated to: Category , subcategory 1, Subcategory 2Product C is 
associated to: Category, subcategory 1etc.
If the category structure was a bit more predictable, it would be easy to load 
the data into Solr and understand from a UI perspective how best to create a 
faceted hierarchy. 

However, because this category structure is dynamic and different for each 
product, I'm trying to plan for the best course of action in terms of how to 
manage this data into Solr and be able to provide category facets and guide the 
UI to how best to query the data in the appropriate hierarchy.
I'm thinking of loading the category structure using dynamic fields, but any 
good approaches on how best to facet and derive a hierarchy on those dynamic 
fields? Or other thoughts around this.

Hope that makes sense.
Thanks,


Re: Bq Question - Solr 4.10

2015-04-11 Thread Mike L.

Thanks Jack. I'll give that a whirl.
  From: Jack Krupansky 
 To: solr-user@lucene.apache.org; Mike L.  
 Sent: Saturday, April 11, 2015 12:04 PM
 Subject: Re: Bq Question - Solr 4.10
   
It all depends on what you want your scores to look like. Or do you care at all 
what the scores look like?
Here's one strategy... Divide the score space into two halves, the upper half 
for a preferred manufacturer and the lower have for non-preferred 
manufacturers. Step 1: Add 1.0 to the raw Lucene score (bf parameter) if the 
document is a preferred manufacturer. Step 2: Divide the resulting score by 2 
(boost parameter).
So if two documents had the same score, say 0.7, the preferred manufacturer 
would get a score of (1+0.7)/2 = 1.7/2 = 0.85, while the non-preferred 
manufacturer would get a score of 0.7/2 = 0.35.
IOW, apply an additive boost of 1.0 and then a multiplicative boost of 0.5.

-- Jack Krupansky


On Sat, Apr 11, 2015 at 12:28 PM, Mike L.  wrote:

 Hello -
    I have qf boosting setup and that works well and balanced across different 
fields.
However, I have a requirement that if a particular manufacturer is part of the 
returned matched documents (say top 20 results) , all those matched docs from 
that manufacturer should be bumped to the top of the result list.
   From a scoring perspective - when I look at those manufacturer docs vs the 
ones scored higher - there is a big difference there, because the keywords 
searched are much more relevant on other docs.
I'm a bit concerned with the idea of applying an enormous bq boost for that 
particular manufacturer to bump up those doc's - but I suspect it would work. 
On the flip side, I considered using elevate, but there are thousands of 
documents I would have to account and hardcode those doc id's.
Is using bq the best approach or is there a better solution to this?
Thanks,Mike



  

Bq Question - Solr 4.10

2015-04-11 Thread Mike L.
 Hello -
    I have qf boosting setup and that works well and balanced across different 
fields. 
However, I have a requirement that if a particular manufacturer is part of the 
returned matched documents (say top 20 results) , all those matched docs from 
that manufacturer should be bumped to the top of the result list.
   From a scoring perspective - when I look at those manufacturer docs vs the 
ones scored higher - there is a big difference there, because the keywords 
searched are much more relevant on other docs. 
I'm a bit concerned with the idea of applying an enormous bq boost for that 
particular manufacturer to bump up those doc's - but I suspect it would work. 
On the flip side, I considered using elevate, but there are thousands of 
documents I would have to account and hardcode those doc id's.
Is using bq the best approach or is there a better solution to this?
Thanks,Mike

Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.

Typo:   *even when the user delimits with a space. (e.g. base ball should find 
baseball). 

Thanks,
  From: Mike L. 
 To: "solr-user@lucene.apache.org"  
 Sent: Tuesday, April 7, 2015 9:05 AM
 Subject: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words 
File
   

Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball => base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball => base ball  ( 
to allow baseball to also get a hit)
      
  
Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike



  

DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File

2015-04-07 Thread Mike L.

Solr User Group -

   I have a case where I need to be able to search against compound words, even 
when the user delimits with a space. (e.g. baseball => base ball).  I think 
I've solved this by creating a compound-words dictionary file containing the 
split words that I would want DictionaryCompoundWordTokenFilterFactory to split.
 base \n  
ball
I also applied in the synonym file the following rule: baseball => base ball  ( 
to allow baseball to also get a hit)
      
  
Two questions - If I could in advance figure out all the compound words I would 
want to split, would it be better (more reliable results) for me to maintain 
this compount-words file or would it be better to throw one of those open 
office dictionaries at it the filter?
Also - Any better suggestions to dealing with this problem vs the one I 
described using both the dictionary filter and the synonym rule?
Thanks in advance!
Mike



Re: WordDelimiterFilterFactory - tokenizer question

2015-04-05 Thread Mike L.

Thanks Jack! That was oversight on my end - I also assumed the 
splitOnNumerics="1" and LowerCaseFilterFactory would be breaking out the 
tokens. I tried again with generateWordParts="1" generateNumberParts="1" and it 
seemed to work. Appreciate it.

Mike

  From: Jack Krupansky 
 To: solr-user@lucene.apache.org; Mike L.  
 Sent: Sunday, April 5, 2015 8:23 AM
 Subject: Re: WordDelimiterFilterFactory - tokenizer question
   
You have to tell the filter what types of tokens to generate - words, numbers. 
You told it to generate... nothing. You did tell it to preserve the original, 
unfiltered token though, which is fine.
-- Jack Krupansky


On Sun, Apr 5, 2015 at 3:39 AM, Mike L.  wrote:

Solr User Group,
    I have a non-multivalied field with contains stored values similar to this:

US100AUS100BUS100CUS100-DUS100BBA
My assumption is - If I tokenized with the below fieldType definition, 
specifically the WDF -splitOnNumbers and the LowerCaseFilterFactory would have 
have provided me solr matches on the following query words:
?q=US 100?q=US100
across on field values. In other words, all US100A, US100B, US100C, US100-D 
would have matched and scored against my qf weights. However - I'm not seeing 
that sort of behavior and have tried various combinations and starting to 
question my assumptions on the tokenizer.

Ideally - I would like to return all values (US100A, US100B, US100C, US100-D) 
when for example, q=US100A is searched on this field.

I know I should probably provide the debugQuery results, but was hoping this 
was a quick hit for somebody and also I'm reindexing. 
WordDelimiterFilterFactory doesn't seem to be working as expected. Hoping to 
get some clarification or if something sticks out here.

Below is the field type definition being used:
 
   
    
  
 
 
    
   

  
    
  
 
 
 
 
    


Thanks in advance.
Mike








  

WordDelimiterFilterFactory - tokenizer question

2015-04-05 Thread Mike L.
Solr User Group,
    I have a non-multivalied field with contains stored values similar to this: 

US100AUS100BUS100CUS100-DUS100BBA
My assumption is - If I tokenized with the below fieldType definition, 
specifically the WDF -splitOnNumbers and the LowerCaseFilterFactory would have 
have provided me solr matches on the following query words:
?q=US 100?q=US100
across on field values. In other words, all US100A, US100B, US100C, US100-D 
would have matched and scored against my qf weights. However - I'm not seeing 
that sort of behavior and have tried various combinations and starting to 
question my assumptions on the tokenizer. 

Ideally - I would like to return all values (US100A, US100B, US100C, US100-D) 
when for example, q=US100A is searched on this field. 

I know I should probably provide the debugQuery results, but was hoping this 
was a quick hit for somebody and also I'm reindexing. 
WordDelimiterFilterFactory doesn't seem to be working as expected. Hoping to 
get some clarification or if something sticks out here.

Below is the field type definition being used:
 
   
    
  
 
 
    
   
 
  
    
  
 
 
 
 
    


Thanks in advance.
Mike






Re: Max Limit to Schema Fields - Solr 4.X

2014-02-09 Thread Mike L.

Appreciate all the support and I'll give it a whirl. Cheers!

Sent from my iPhone

> On Feb 8, 2014, at 4:25 PM, Shawn Heisey  wrote:
> 
>> On 2/8/2014 12:12 PM, Mike L. wrote:
>> Im going to try loading all 3000 fields in the schema and see how that goes. 
>> Only concern is doing boolean searches and whether or not Ill run into URL 
>> length issues but I guess Ill find out soon.
> 
> It will likely work without a problem.  As already mentioned, you may
> need to increase maxBooleanClauses in solrconfig.xml beyond the default
> of 1024.
> 
> The max URL size is configurable with any decent servlet container,
> including the jetty that comes with the Solr example.  In the part of
> the jetty config that adds the connector, this increases the max HTTP
> header size to 32K, and the size for the entire HTTP buffer to 64K.
> These may not be big enough with 3000 fields, but it gives you the
> general idea:
> 
>32768
>65536
> 
> Another option is to use a POST request instead of a GET request with
> the parameters in the posted body.  The default POST buffer size in
> Jetty is 200K.  In newer versions of Solr, the limit is actually set by
> Solr, not the servlet container, and defaults to 2MB.  I believe that if
> you are using SolrJ, it uses POST requests by default.
> 
> Thanks,
> Shawn
> 


Re: Max Limit to Schema Fields - Solr 4.X

2014-02-08 Thread Mike L.
That was the original plan. 

However its important to preserve the originating field that loaded the value. 
The data is very fine and granular and each field stores a particular value. 
When searching the data against solr - it would be important to know what docs 
contain that particular data from that particular field. (fielda:value, 
fieldb:value ) where as searching field:value would hide what originating field 
loaded this value. 

Im going to try loading all 3000 fields in the schema and see how that goes. 
Only concern is doing boolean searches and whether or not Ill run into URL 
length issues but I guess Ill find out soon.

Thanks again!

Sent from my iPhone

> On Feb 6, 2014, at 1:02 PM, Erick Erickson  wrote:
> 
> Sometimes you can spoof the many fields problem by using prefixes on the
> data. Rather than fielda, fieldb... Have field and index values like
> fielda_value, fieldb_value into a single field. Then do the right thing
> when searching. Watch tokenization though.
> 
> Best
> Erick
>> On Feb 5, 2014 4:59 AM, "Mike L."  wrote:
>> 
>> 
>> Thanks Shawn. This is good to know.
>> 
>> 
>> Sent from my iPhone
>> 
>>>> On Feb 5, 2014, at 12:53 AM, Shawn Heisey  wrote:
>>>> 
>>>> On 2/4/2014 8:00 PM, Mike L. wrote:
>>>> I'm just wondering here if there is any defined limit to how many
>> fields can be created within a schema? I'm sure the configuration
>> maintenance of a schema like this would be a nightmare, but would like to
>> know if its at all possible in the first place before It may be attempted.
>>> 
>>> There are no hard limits on the number of fields, whether they are
>>> dynamically defined or not. Several thousand fields should be no
>>> problem.  If you have enough system resources and you don't run into an
>>> unlikely bug, there's no reason it won't work.  As you've already been
>>> told, there are potential performance concerns.  Depending on the exact
>>> nature of your queries, you might need to increase maxBooleanClauses.
>>> 
>>> The only hard limitation that Lucene really has (and by extension, Solr
>>> also has that limitation) is that a single index cannot have more than
>>> about two billion documents in it - the inherent limitation on a Java
>>> "int" type.  Solr can use indexes larger than this through sharding.
>>> 
>>> See the very end of this page:
>> https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html#Limitations
>>> 
>>> Thanks,
>>> Shawn
>> 


Re: Max Limit to Schema Fields - Solr 4.X

2014-02-05 Thread Mike L.

Thanks Shawn. This is good to know. 


Sent from my iPhone

> On Feb 5, 2014, at 12:53 AM, Shawn Heisey  wrote:
> 
>> On 2/4/2014 8:00 PM, Mike L. wrote:
>> I'm just wondering here if there is any defined limit to how many fields can 
>> be created within a schema? I'm sure the configuration maintenance of a 
>> schema like this would be a nightmare, but would like to know if its at all 
>> possible in the first place before It may be attempted.
> 
> There are no hard limits on the number of fields, whether they are
> dynamically defined or not. Several thousand fields should be no
> problem.  If you have enough system resources and you don't run into an
> unlikely bug, there's no reason it won't work.  As you've already been
> told, there are potential performance concerns.  Depending on the exact
> nature of your queries, you might need to increase maxBooleanClauses.
> 
> The only hard limitation that Lucene really has (and by extension, Solr
> also has that limitation) is that a single index cannot have more than
> about two billion documents in it - the inherent limitation on a Java
> "int" type.  Solr can use indexes larger than this through sharding.
> 
> See the very end of this page:
> 
> https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html#Limitations
> 
> Thanks,
> Shawn
> 


Re: Max Limit to Schema Fields - Solr 4.X

2014-02-04 Thread Mike L.

Hey Jack - 

Two types of queries:

A) Return all docs that have a match for a particular value from a particular 
field  (fq=fieldname:value). Because of this I feel Im tied to defining all the 
fields. No particular field matters more than another - depends on the search 
context so hard to predict common searches.

B) Return all docs whom have a particular value in one or more fields.
(small subset of the 3000). 

Ive been a bit spoiled with Solr being used to response times less than 50ms, 
but in this case - search does not have to be fast. Also total index size would 
be less than a 1GB and with less than 1M total docs. 
 
-Mike

Sent from my iPhone

> On Feb 4, 2014, at 10:38 PM, "Jack Krupansky"  wrote:
> 
> What will your queries be like? Will it be okay if they are relatively slow? 
> I mean, how many of those 100 fields will you need to use in a typical (95th 
> percentile) query?
> 
> -- Jack Krupansky
> 
> -----Original Message- From: Mike L.
> Sent: Tuesday, February 4, 2014 10:00 PM
> To: solr-user@lucene.apache.org
> Subject: Max Limit to Schema Fields - Solr 4.X
> 
> 
> solr user group -
> 
>   I'm afraid I may have a scenario where I might need to define a few 
> thousand fields in Solr. The context here is, this type of data is extremely 
> granular and unfortunately cannot be grouped into logical groupings or 
> aggregate fields because there is a need to know which granular field 
> contains the data, and those field needs to be searchable.
> 
> With that said, I expect each  to not contain more than 100 fields with 
> loaded data at a given time. Its just not clear of the few thousand fields 
> created, which ones will have the data pertaining to that doc.
> 
> I'm just wondering here if there is any defined limit to how many fields can 
> be created within a schema? I'm sure the configuration maintenance of a 
> schema like this would be a nightmare, but would like to know if its at all 
> possible in the first place before It may be attempted.
> 
> Thanks in advance -
> Mike 


Max Limit to Schema Fields - Solr 4.X

2014-02-04 Thread Mike L.
 
solr user group -
 
    I'm afraid I may have a scenario where I might need to define a few 
thousand fields in Solr. The context here is, this type of data is extremely 
granular and unfortunately cannot be grouped into logical groupings or 
aggregate fields because there is a need to know which granular field contains 
the data, and those field needs to be searchable. 
 
 With that said, I expect each  to not contain more than 100 fields with 
loaded data at a given time. Its just not clear of the few thousand fields 
created, which ones will have the data pertaining to that doc. 
 
I'm just wondering here if there is any defined limit to how many fields can be 
created within a schema? I'm sure the configuration maintenance of a schema 
like this would be a nightmare, but would like to know if its at all possible 
in the first place before It may be attempted.
 
Thanks in advance -
Mike

Re: ContributorsGroup

2013-09-26 Thread Mike L.
 
ah sorry! its: mikelabib 
 
thanks!

From: Stefan Matheis 
To: solr-user@lucene.apache.org 
Sent: Thursday, September 26, 2013 12:05 PM
Subject: Re: ContributorsGroup


Mike

To add you as Contributor i'd need to know your Username? :)

Stefan 


On Thursday, September 26, 2013 at 6:50 PM, Mike L. wrote:

>  
> Solr Admins,
>  
>      I've been using Solr for the last couple years and would like to 
>contribute to this awesome project. Can I be added to the Contributorsgroup 
>with also access to update the Wiki?
>  
> Thanks in advance.
>  
> Mike L.
> 
> 

ContributorsGroup

2013-09-26 Thread Mike L.
 
Solr Admins,
 
 I've been using Solr for the last couple years and would like to 
contribute to this awesome project. Can I be added to the Contributorsgroup 
with also access to update the Wiki?
 
Thanks in advance.
 
Mike L.

Re: Solr 4.4 Import from CSV to Multi-value field - Adds quote on last value

2013-09-25 Thread Mike L.
 
Nevermind, I figured it out. Excel was applying a hidden quote on the data. 
Thanks anyway.

From: Mike L. 
To: "solr-user@lucene.apache.org"  
Sent: Wednesday, September 25, 2013 11:32 AM
Subject: Solr 4.4 Import from CSV to Multi-value field - Adds quote on last 
value


 
Solr Family,
 
    I'm a Solr 3.6 user who just pulled down 4.4 yesterday and noticed 
something a bit odd when importing into a multi-valued field. I wouldn't be 
surprised if there's a user-error on my end but hopefully there isn't a bug. 
Here's the situation.
 
I created some test data to import and one field needs to be split into a 
multi-valued field. This data resides within a .csv file and is structured like 
the following: 
 
(below are replacement field names. Also note - there are no quotes " within 
the data.)
 
field1|field2|field3|field4_valueA,field4_valueB,field4_valueC
 http://[myserver]/solr/[my 
corename]/update?commit=true&separator=|&escape=\&stream.file=[location of 
file]&fieldnames=field1,field2,field3,field4&optimize=true&stream.contentType=application/csv&f.field4.split=true&f.field4.separator=%2C
 
After importing the data, I see similiar results as the below for the 
multi-valued field , field4: 
 

field4_valueA
field4_valueB
field4_valueC"  (Why is there a trailing quote here?) 

 
I also noticed if only 1 value is being inserted into this multivalued field - 
there is no issue. It always happens on the last value.
 
Thanks in advance,
Cheers!
Mike

Solr 4.4 Import from CSV to Multi-value field - Adds quote on last value

2013-09-25 Thread Mike L.
 
Solr Family,
 
    I'm a Solr 3.6 user who just pulled down 4.4 yesterday and noticed 
something a bit odd when importing into a multi-valued field. I wouldn't be 
surprised if there's a user-error on my end but hopefully there isn't a bug. 
Here's the situation.
 
I created some test data to import and one field needs to be split into a 
multi-valued field. This data resides within a .csv file and is structured like 
the following: 
 
(below are replacement field names. Also note - there are no quotes " within 
the data.)
 
field1|field2|field3|field4_valueA,field4_valueB,field4_valueC
 http://[myserver]/solr/[my 
corename]/update?commit=true&separator=|&escape=\&stream.file=[location of 
file]&fieldnames=field1,field2,field3,field4&optimize=true&stream.contentType=application/csv&f.field4.split=true&f.field4.separator=%2C
 
After importing the data, I see similiar results as the below for the 
multi-valued field , field4: 
 

field4_valueA
field4_valueB
field4_valueC"  (Why is there a trailing quote here?) 

 
I also noticed if only 1 value is being inserted into this multivalued field - 
there is no issue. It always happens on the last value.
 
Thanks in advance,
Cheers!
Mike

Returning Hierarchical / Data Relationships In Solr 3.6 (or Solr 4 via Solr Join)

2013-07-17 Thread Mike L.
 
Solr User Group,
 
 I would like to return a hierarchical data relationship when somebody 
queries for a parent doc in solr. This sort of relationship doesn't currently 
exist in our core as the use-case has been to search for a specific document 
only. However, here's kind of an example of what's being asked: (not the same 
kind of relationship though, but a smiliar concept. However, there will always 
be only 1 parent to many children)
 
User search's for a parent value and also gets child docs as part of the 
response (not child names as multi-valued fields)
 
For example, say: select?qt=parentvalueearch&q=[parentValue]
 
 
    
  1 
  [parentValue]  
  parent 
  John 
  Doe 
  M 
  
    
  2 
  child 
  Chris 
  Doe 
  M 
  
    
  3 
  child 
  Stacy 
  Doe 
  F 
  

 
At first I was thinking I could just add a field within each child doc to 
represent the parentValue, however, this family relationship is a bit more 
complex as children can be associated to many different parents (parent docs) 
so I don't want to tie the relationship off the child. On the flip side, it 
seems, I could have a multi-valued field with all the childnames within the 
parent doc and then requery the core for the child docs and append them to the 
response...the caveat there is this parent may have a few hundred children and 
not sure if a multi-valued field would make sense to store the children 
references...also this approach would dramatically increase the response time 
from on average 20ms to ~4sec assumming a parent has 200 children. 
 
Anybody solve for a similiar issue or have thoughts on best way to tackle this 
with version 3.6? Also could Solr Joins introduced in 4.X address this issue? 
(Not too familiar with it but seems to be related)
 
Thanks in advance!
Mike

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-03 Thread Mike L.
Hey Shawn / Solr User Group,
 
This makes perfect sense to me. Thanks for the thorough answer.  
 "The CSV update handler works at a lower level than the DataImport 
handler, and doesn't 
have "clean" or "full-import" options, which defaults to clean=true. The DIH is 
like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When clean=true or 
using full-import without a clean option, DIH itself sends 
a "delete all documents" update request."
 
And similiarly, my assumption is in the event of a non-syntactical 
failure/interuption  (such as a server crash) during the CSV Update a rollback 
(stream.body=) would also need to be manually requested (or 
automatted but outside of Solr) where as the DIH automates this Request on my 
behalf as well...? Is there anyway to detect this failure or interuption?...A 
real example is, I was in the process of indexing data via the CSV Update and 
somebody bounced the server before it completed. No actual errors were produced 
but it appeared that the CSV Update process stopped at the point of the reboot. 
My assumption is, if I had passed in a rollback, I'd get the previously indexed 
data , given I didn't request a delete beforehand (haven't yet tested this). 
But wondering, how I could automatically detect this? This I guess is where DIH 
starts gaining some merit. Also - the response that the DIH produces when the 
indexing process is complete appears
 to be a lot more mature in that it explicity suggest the index completed and 
that information can can be re-queried. It would be nice if the CSV Update 
provided a similiar response..my assumption is it would first need to know how 
many lines exist on the file in order to know whether or not the job actually 
completed...
 
 Also - outside of solr initiating a delete due to encountering the same 
UniqueKey, is there anything else that could cause a delete to be initiated by 
Solr? 

Lastly, is there any concern of running multiple Update CSV requests on 
different data files containing different data? 

Thanks in advance. This was very helpful.

Mike
 


From: Shawn Heisey 
To: solr-user@lucene.apache.org 
Sent: Monday, July 1, 2013 2:30 PM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


On 7/1/2013 12:56 PM, Mike L. wrote:
>  Hey Ahmet / Solr User Group,
>
>    I tried using the built in UpdateCSV and it runs A LOT faster than a 
>FileDataSource DIH as illustrated below. However, I am a bit confused about 
>the numDocs/maxDoc values when doing an import this way. Here's my Get command 
>against a Tab delimted file: (I removed server info and additional fields.. 
>everything else is the same)
>
> http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields
>
>
> My response from solr
>
> 
> 
> 0 name="QTime">591
> 
>
> I am experimenting with 2 csv files (1 with 10 records, the other with 1000) 
> to see If I can get this to run correctly before running my entire collection 
> of data. I initially loaded the first 1000 records to an empty core and that 
> seemed to work, however, but when running the above with a csv file that has 
> 10 records, I would like to see only 10 active records in my core. What I get 
> instead, when looking at my stats page:
>
> numDocs 1000
> maxDoc 1010
>
> If I run the same url above while appending an 'optimize=true', I get:
>
> numDocs 1000,
> maxDoc 1000.

A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:

If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have "clean" or "full-import" options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a "delete all documents" update request.

If you didn't already know the bit about the deleted documents, then 
read this:

It can be normal for indexing "new" documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
docu

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Mike L.
 Hey Ahmet / Solr User Group,
 
   I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields


My response from solr 



0591

 
I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page: 

numDocs 1000 
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000, 
maxDoc 1000.

Perhaps the commit=true is not doing what its supposed to or am I missing 
something? I also trying passing a commit afterward like this:
http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
seem to do anything either)
 

From: Ahmet Arslan 
To: "solr-user@lucene.apache.org" ; Mike L. 
 
Sent: Saturday, June 29, 2013 7:20 AM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





From: Mike L. 
To: "solr-user@lucene.apache.org"  
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the "recommended" approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the "standard" approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 

 
   
 
 
 
   

 
Thanks in advance,
Mike

Thanks in advance,
Mike

FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-28 Thread Mike L.
 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the "recommended" approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the "standard" approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 

 
   
 
 
 
   

 
Thanks in advance,
Mike

Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.
Thanks for the response.
 
Here's the scrubbed version of my DIH: http://apaste.info/6uGH 
 
It contains everything I'm more or less doing...pretty straight forward.. One 
thing to note and I don't know if this is a bug or not, but the batchSize="-1" 
streaming feature doesn't seem to work, at least with informix jdbc drivers. I 
set the batchsize to "500", but have tested it with various numbers including 
5000, 1. I'm aware that behind the scenes this should be just setting the 
fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set 
as a global DB param and can't be modified (which I haven't looked into 
afterward.)
 
As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.
 
I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:
 
readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"
 
What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.
 
Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak.. 
 
Thanks!
Mike
 

From: Shawn Heisey 
To: solr-user@lucene.apache.org 
Sent: Wednesday, June 26, 2013 12:13 PM
Subject: Re: Parallal Import Process on same core. Solr 3.5


On 6/26/2013 10:58 AM, Mike L. wrote:
>  
> Hello,
>  
>        I'm trying to execute a parallel DIH process and running into heap 
>related issues, hoping somebody has experienced this and can recommend some 
>options..
>  
>        Using Solr 3.5 on CentOS.
>        Currently have JVM heap 4GB min , 8GB max
>  
>      When executing the entities in a sequential process (entities executing 
>in sequence by default), my heap never exceeds 3GB. When executing the 
>parallel process, everything runs fine for roughly an hour, then I reach the 
>8GB max heap size and the process stalls/fails.
>  
>      More specifically, here's how I'm executing the parallel import process: 
>I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
>VALUE') within my entity queries. And within Solrconfig.xml, I've created 
>corresponding data import handlers, one for each of these entities.
>  
> My total rows fetch/count is 9M records.
>  
> And when I initiate the import, I call each one, similar to the below 
> (obviously I've stripped out my server & naming conventions.
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
>  
>  
> I assume that when doing this, only the first import request needs to contain 
> the clean=true param. 
>  
> I've divided each import query to target roughly the same amount of data, and 
> in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info/is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn

Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.
 
Hello,
 
   I'm trying to execute a parallel DIH process and running into heap 
related issues, hoping somebody has experienced this and can recommend some 
options..
 
   Using Solr 3.5 on CentOS.
   Currently have JVM heap 4GB min , 8GB max
 
 When executing the entities in a sequential process (entities executing in 
sequence by default), my heap never exceeds 3GB. When executing the parallel 
process, everything runs fine for roughly an hour, then I reach the 8GB max 
heap size and the process stalls/fails.
 
 More specifically, here's how I'm executing the parallel import process: I 
target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
VALUE') within my entity queries. And within Solrconfig.xml, I've created 
corresponding data import handlers, one for each of these entities.
 
My total rows fetch/count is 9M records.
 
And when I initiate the import, I call each one, similar to the below 
(obviously I've stripped out my server & naming conventions.
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
 
 
I assume that when doing this, only the first import request needs to contain 
the clean=true param. 
 
I've divided each import query to target roughly the same amount of data, and 
in solrconfig, I've tried various things in hopes to reduce heap size.
 
Here's my current config: 
 
 false
    15    
    100 
    2147483647
    1
    1000
    1
    single
  
  
    false
    100   
    15
    2147483647
    1
    false
  

 

   
  6  
  25000  
    
    10
 

 
What gets tricky is finding the sweet spot with these parameters, but wondering 
if anybody has any recommendations for an optimal config. Also, regarding 
autoCommit, I've even turned that feature off, but my heap size reaches its max 
sooner. I am wondering though, what would be the difference with autoCommit and 
passing in the commit=true param on each import query.
 
Thanks in advance!
Mike