Re: Category Hierarchy on Dynamic Fields - Solr 4.10
Hey Erick, Thanks for the response. I haven't used PathHierarchyTokenizerFactory before... Seems with this approach, I'd flatten all the categories pertaining to a product into one Solr field... I guess using the facet.prefix I wont lose the granularity I need to facet on individual categories and can maintain the hierarchy.. I'll take a look at it. Thanks! From: Erick Erickson To: solr-user@lucene.apache.org; Mike L. Sent: Monday, July 6, 2015 12:42 PM Subject: Re: Category Hierarchy on Dynamic Fields - Solr 4.10 Hmmm, probably missing something here, but have you looked at PathHierarchyTokenizerFactory? In essence, it indexes each sub-path as a token, which makes lots of faceting tasks easier. I.e. lev1/lev2/lev3 gets tokenized as three tokens lev1/lev2/lev3 lev1/lev2 lev1 If that works, I'm not quite sure why you need dynamic fields here, but then I only skimmed. Best, Erick On Mon, Jul 6, 2015 at 10:33 AM, Mike L. wrote: > > Solr User Group - > Was wondering if anybody had any suggestions/best practices around a >requirement for storing a dynamic category structure that needs to have the >ability to facet on and maintain its hierarchy > Some context: > > A product could belong to an undetermined amount of product categories that > contain a logical hierarchy. In other-wards, depending on a vendor being > used, each product could contain anywhere from 1-N levels deep of product > categories. These categories do have a hierarchy that would need to be > maintained such that the website could drill down those category facets to > find the appropriate product. > > Examples:Product A is associated to : Category , Subcategory 1, Subcategory > 2, Subcategory 3, Subcategory 4 > Product B is associated to: Category , subcategory 1, Subcategory 2Product C > is associated to: Category, subcategory 1etc. > If the category structure was a bit more predictable, it would be easy to > load the data into Solr and understand from a UI perspective how best to > create a faceted hierarchy. > > However, because this category structure is dynamic and different for each > product, I'm trying to plan for the best course of action in terms of how to > manage this data into Solr and be able to provide category facets and guide > the UI to how best to query the data in the appropriate hierarchy. > I'm thinking of loading the category structure using dynamic fields, but any > good approaches on how best to facet and derive a hierarchy on those dynamic > fields? Or other thoughts around this. > > Hope that makes sense. > Thanks,
Category Hierarchy on Dynamic Fields - Solr 4.10
Solr User Group - Was wondering if anybody had any suggestions/best practices around a requirement for storing a dynamic category structure that needs to have the ability to facet on and maintain its hierarchy Some context: A product could belong to an undetermined amount of product categories that contain a logical hierarchy. In other-wards, depending on a vendor being used, each product could contain anywhere from 1-N levels deep of product categories. These categories do have a hierarchy that would need to be maintained such that the website could drill down those category facets to find the appropriate product. Examples:Product A is associated to : Category , Subcategory 1, Subcategory 2, Subcategory 3, Subcategory 4 Product B is associated to: Category , subcategory 1, Subcategory 2Product C is associated to: Category, subcategory 1etc. If the category structure was a bit more predictable, it would be easy to load the data into Solr and understand from a UI perspective how best to create a faceted hierarchy. However, because this category structure is dynamic and different for each product, I'm trying to plan for the best course of action in terms of how to manage this data into Solr and be able to provide category facets and guide the UI to how best to query the data in the appropriate hierarchy. I'm thinking of loading the category structure using dynamic fields, but any good approaches on how best to facet and derive a hierarchy on those dynamic fields? Or other thoughts around this. Hope that makes sense. Thanks,
Re: Bq Question - Solr 4.10
Thanks Jack. I'll give that a whirl. From: Jack Krupansky To: solr-user@lucene.apache.org; Mike L. Sent: Saturday, April 11, 2015 12:04 PM Subject: Re: Bq Question - Solr 4.10 It all depends on what you want your scores to look like. Or do you care at all what the scores look like? Here's one strategy... Divide the score space into two halves, the upper half for a preferred manufacturer and the lower have for non-preferred manufacturers. Step 1: Add 1.0 to the raw Lucene score (bf parameter) if the document is a preferred manufacturer. Step 2: Divide the resulting score by 2 (boost parameter). So if two documents had the same score, say 0.7, the preferred manufacturer would get a score of (1+0.7)/2 = 1.7/2 = 0.85, while the non-preferred manufacturer would get a score of 0.7/2 = 0.35. IOW, apply an additive boost of 1.0 and then a multiplicative boost of 0.5. -- Jack Krupansky On Sat, Apr 11, 2015 at 12:28 PM, Mike L. wrote: Hello - I have qf boosting setup and that works well and balanced across different fields. However, I have a requirement that if a particular manufacturer is part of the returned matched documents (say top 20 results) , all those matched docs from that manufacturer should be bumped to the top of the result list. From a scoring perspective - when I look at those manufacturer docs vs the ones scored higher - there is a big difference there, because the keywords searched are much more relevant on other docs. I'm a bit concerned with the idea of applying an enormous bq boost for that particular manufacturer to bump up those doc's - but I suspect it would work. On the flip side, I considered using elevate, but there are thousands of documents I would have to account and hardcode those doc id's. Is using bq the best approach or is there a better solution to this? Thanks,Mike
Bq Question - Solr 4.10
Hello - I have qf boosting setup and that works well and balanced across different fields. However, I have a requirement that if a particular manufacturer is part of the returned matched documents (say top 20 results) , all those matched docs from that manufacturer should be bumped to the top of the result list. From a scoring perspective - when I look at those manufacturer docs vs the ones scored higher - there is a big difference there, because the keywords searched are much more relevant on other docs. I'm a bit concerned with the idea of applying an enormous bq boost for that particular manufacturer to bump up those doc's - but I suspect it would work. On the flip side, I considered using elevate, but there are thousands of documents I would have to account and hardcode those doc id's. Is using bq the best approach or is there a better solution to this? Thanks,Mike
Re: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File
Typo: *even when the user delimits with a space. (e.g. base ball should find baseball). Thanks, From: Mike L. To: "solr-user@lucene.apache.org" Sent: Tuesday, April 7, 2015 9:05 AM Subject: DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File Solr User Group - I have a case where I need to be able to search against compound words, even when the user delimits with a space. (e.g. baseball => base ball). I think I've solved this by creating a compound-words dictionary file containing the split words that I would want DictionaryCompoundWordTokenFilterFactory to split. base \n ball I also applied in the synonym file the following rule: baseball => base ball ( to allow baseball to also get a hit) Two questions - If I could in advance figure out all the compound words I would want to split, would it be better (more reliable results) for me to maintain this compount-words file or would it be better to throw one of those open office dictionaries at it the filter? Also - Any better suggestions to dealing with this problem vs the one I described using both the dictionary filter and the synonym rule? Thanks in advance! Mike
DictionaryCompoundWordTokenFilterFactory - Dictionary/Compound-Words File
Solr User Group - I have a case where I need to be able to search against compound words, even when the user delimits with a space. (e.g. baseball => base ball). I think I've solved this by creating a compound-words dictionary file containing the split words that I would want DictionaryCompoundWordTokenFilterFactory to split. base \n ball I also applied in the synonym file the following rule: baseball => base ball ( to allow baseball to also get a hit) Two questions - If I could in advance figure out all the compound words I would want to split, would it be better (more reliable results) for me to maintain this compount-words file or would it be better to throw one of those open office dictionaries at it the filter? Also - Any better suggestions to dealing with this problem vs the one I described using both the dictionary filter and the synonym rule? Thanks in advance! Mike
Re: WordDelimiterFilterFactory - tokenizer question
Thanks Jack! That was oversight on my end - I also assumed the splitOnNumerics="1" and LowerCaseFilterFactory would be breaking out the tokens. I tried again with generateWordParts="1" generateNumberParts="1" and it seemed to work. Appreciate it. Mike From: Jack Krupansky To: solr-user@lucene.apache.org; Mike L. Sent: Sunday, April 5, 2015 8:23 AM Subject: Re: WordDelimiterFilterFactory - tokenizer question You have to tell the filter what types of tokens to generate - words, numbers. You told it to generate... nothing. You did tell it to preserve the original, unfiltered token though, which is fine. -- Jack Krupansky On Sun, Apr 5, 2015 at 3:39 AM, Mike L. wrote: Solr User Group, I have a non-multivalied field with contains stored values similar to this: US100AUS100BUS100CUS100-DUS100BBA My assumption is - If I tokenized with the below fieldType definition, specifically the WDF -splitOnNumbers and the LowerCaseFilterFactory would have have provided me solr matches on the following query words: ?q=US 100?q=US100 across on field values. In other words, all US100A, US100B, US100C, US100-D would have matched and scored against my qf weights. However - I'm not seeing that sort of behavior and have tried various combinations and starting to question my assumptions on the tokenizer. Ideally - I would like to return all values (US100A, US100B, US100C, US100-D) when for example, q=US100A is searched on this field. I know I should probably provide the debugQuery results, but was hoping this was a quick hit for somebody and also I'm reindexing. WordDelimiterFilterFactory doesn't seem to be working as expected. Hoping to get some clarification or if something sticks out here. Below is the field type definition being used: Thanks in advance. Mike
WordDelimiterFilterFactory - tokenizer question
Solr User Group, I have a non-multivalied field with contains stored values similar to this: US100AUS100BUS100CUS100-DUS100BBA My assumption is - If I tokenized with the below fieldType definition, specifically the WDF -splitOnNumbers and the LowerCaseFilterFactory would have have provided me solr matches on the following query words: ?q=US 100?q=US100 across on field values. In other words, all US100A, US100B, US100C, US100-D would have matched and scored against my qf weights. However - I'm not seeing that sort of behavior and have tried various combinations and starting to question my assumptions on the tokenizer. Ideally - I would like to return all values (US100A, US100B, US100C, US100-D) when for example, q=US100A is searched on this field. I know I should probably provide the debugQuery results, but was hoping this was a quick hit for somebody and also I'm reindexing. WordDelimiterFilterFactory doesn't seem to be working as expected. Hoping to get some clarification or if something sticks out here. Below is the field type definition being used: Thanks in advance. Mike
Re: Max Limit to Schema Fields - Solr 4.X
Appreciate all the support and I'll give it a whirl. Cheers! Sent from my iPhone > On Feb 8, 2014, at 4:25 PM, Shawn Heisey wrote: > >> On 2/8/2014 12:12 PM, Mike L. wrote: >> Im going to try loading all 3000 fields in the schema and see how that goes. >> Only concern is doing boolean searches and whether or not Ill run into URL >> length issues but I guess Ill find out soon. > > It will likely work without a problem. As already mentioned, you may > need to increase maxBooleanClauses in solrconfig.xml beyond the default > of 1024. > > The max URL size is configurable with any decent servlet container, > including the jetty that comes with the Solr example. In the part of > the jetty config that adds the connector, this increases the max HTTP > header size to 32K, and the size for the entire HTTP buffer to 64K. > These may not be big enough with 3000 fields, but it gives you the > general idea: > >32768 >65536 > > Another option is to use a POST request instead of a GET request with > the parameters in the posted body. The default POST buffer size in > Jetty is 200K. In newer versions of Solr, the limit is actually set by > Solr, not the servlet container, and defaults to 2MB. I believe that if > you are using SolrJ, it uses POST requests by default. > > Thanks, > Shawn >
Re: Max Limit to Schema Fields - Solr 4.X
That was the original plan. However its important to preserve the originating field that loaded the value. The data is very fine and granular and each field stores a particular value. When searching the data against solr - it would be important to know what docs contain that particular data from that particular field. (fielda:value, fieldb:value ) where as searching field:value would hide what originating field loaded this value. Im going to try loading all 3000 fields in the schema and see how that goes. Only concern is doing boolean searches and whether or not Ill run into URL length issues but I guess Ill find out soon. Thanks again! Sent from my iPhone > On Feb 6, 2014, at 1:02 PM, Erick Erickson wrote: > > Sometimes you can spoof the many fields problem by using prefixes on the > data. Rather than fielda, fieldb... Have field and index values like > fielda_value, fieldb_value into a single field. Then do the right thing > when searching. Watch tokenization though. > > Best > Erick >> On Feb 5, 2014 4:59 AM, "Mike L." wrote: >> >> >> Thanks Shawn. This is good to know. >> >> >> Sent from my iPhone >> >>>> On Feb 5, 2014, at 12:53 AM, Shawn Heisey wrote: >>>> >>>> On 2/4/2014 8:00 PM, Mike L. wrote: >>>> I'm just wondering here if there is any defined limit to how many >> fields can be created within a schema? I'm sure the configuration >> maintenance of a schema like this would be a nightmare, but would like to >> know if its at all possible in the first place before It may be attempted. >>> >>> There are no hard limits on the number of fields, whether they are >>> dynamically defined or not. Several thousand fields should be no >>> problem. If you have enough system resources and you don't run into an >>> unlikely bug, there's no reason it won't work. As you've already been >>> told, there are potential performance concerns. Depending on the exact >>> nature of your queries, you might need to increase maxBooleanClauses. >>> >>> The only hard limitation that Lucene really has (and by extension, Solr >>> also has that limitation) is that a single index cannot have more than >>> about two billion documents in it - the inherent limitation on a Java >>> "int" type. Solr can use indexes larger than this through sharding. >>> >>> See the very end of this page: >> https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html#Limitations >>> >>> Thanks, >>> Shawn >>
Re: Max Limit to Schema Fields - Solr 4.X
Thanks Shawn. This is good to know. Sent from my iPhone > On Feb 5, 2014, at 12:53 AM, Shawn Heisey wrote: > >> On 2/4/2014 8:00 PM, Mike L. wrote: >> I'm just wondering here if there is any defined limit to how many fields can >> be created within a schema? I'm sure the configuration maintenance of a >> schema like this would be a nightmare, but would like to know if its at all >> possible in the first place before It may be attempted. > > There are no hard limits on the number of fields, whether they are > dynamically defined or not. Several thousand fields should be no > problem. If you have enough system resources and you don't run into an > unlikely bug, there's no reason it won't work. As you've already been > told, there are potential performance concerns. Depending on the exact > nature of your queries, you might need to increase maxBooleanClauses. > > The only hard limitation that Lucene really has (and by extension, Solr > also has that limitation) is that a single index cannot have more than > about two billion documents in it - the inherent limitation on a Java > "int" type. Solr can use indexes larger than this through sharding. > > See the very end of this page: > > https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html#Limitations > > Thanks, > Shawn >
Re: Max Limit to Schema Fields - Solr 4.X
Hey Jack - Two types of queries: A) Return all docs that have a match for a particular value from a particular field (fq=fieldname:value). Because of this I feel Im tied to defining all the fields. No particular field matters more than another - depends on the search context so hard to predict common searches. B) Return all docs whom have a particular value in one or more fields. (small subset of the 3000). Ive been a bit spoiled with Solr being used to response times less than 50ms, but in this case - search does not have to be fast. Also total index size would be less than a 1GB and with less than 1M total docs. -Mike Sent from my iPhone > On Feb 4, 2014, at 10:38 PM, "Jack Krupansky" wrote: > > What will your queries be like? Will it be okay if they are relatively slow? > I mean, how many of those 100 fields will you need to use in a typical (95th > percentile) query? > > -- Jack Krupansky > > -----Original Message- From: Mike L. > Sent: Tuesday, February 4, 2014 10:00 PM > To: solr-user@lucene.apache.org > Subject: Max Limit to Schema Fields - Solr 4.X > > > solr user group - > > I'm afraid I may have a scenario where I might need to define a few > thousand fields in Solr. The context here is, this type of data is extremely > granular and unfortunately cannot be grouped into logical groupings or > aggregate fields because there is a need to know which granular field > contains the data, and those field needs to be searchable. > > With that said, I expect each to not contain more than 100 fields with > loaded data at a given time. Its just not clear of the few thousand fields > created, which ones will have the data pertaining to that doc. > > I'm just wondering here if there is any defined limit to how many fields can > be created within a schema? I'm sure the configuration maintenance of a > schema like this would be a nightmare, but would like to know if its at all > possible in the first place before It may be attempted. > > Thanks in advance - > Mike
Max Limit to Schema Fields - Solr 4.X
solr user group - I'm afraid I may have a scenario where I might need to define a few thousand fields in Solr. The context here is, this type of data is extremely granular and unfortunately cannot be grouped into logical groupings or aggregate fields because there is a need to know which granular field contains the data, and those field needs to be searchable. With that said, I expect each to not contain more than 100 fields with loaded data at a given time. Its just not clear of the few thousand fields created, which ones will have the data pertaining to that doc. I'm just wondering here if there is any defined limit to how many fields can be created within a schema? I'm sure the configuration maintenance of a schema like this would be a nightmare, but would like to know if its at all possible in the first place before It may be attempted. Thanks in advance - Mike
Re: ContributorsGroup
ah sorry! its: mikelabib thanks! From: Stefan Matheis To: solr-user@lucene.apache.org Sent: Thursday, September 26, 2013 12:05 PM Subject: Re: ContributorsGroup Mike To add you as Contributor i'd need to know your Username? :) Stefan On Thursday, September 26, 2013 at 6:50 PM, Mike L. wrote: > > Solr Admins, > > I've been using Solr for the last couple years and would like to >contribute to this awesome project. Can I be added to the Contributorsgroup >with also access to update the Wiki? > > Thanks in advance. > > Mike L. > >
ContributorsGroup
Solr Admins, I've been using Solr for the last couple years and would like to contribute to this awesome project. Can I be added to the Contributorsgroup with also access to update the Wiki? Thanks in advance. Mike L.
Re: Solr 4.4 Import from CSV to Multi-value field - Adds quote on last value
Nevermind, I figured it out. Excel was applying a hidden quote on the data. Thanks anyway. From: Mike L. To: "solr-user@lucene.apache.org" Sent: Wednesday, September 25, 2013 11:32 AM Subject: Solr 4.4 Import from CSV to Multi-value field - Adds quote on last value Solr Family, I'm a Solr 3.6 user who just pulled down 4.4 yesterday and noticed something a bit odd when importing into a multi-valued field. I wouldn't be surprised if there's a user-error on my end but hopefully there isn't a bug. Here's the situation. I created some test data to import and one field needs to be split into a multi-valued field. This data resides within a .csv file and is structured like the following: (below are replacement field names. Also note - there are no quotes " within the data.) field1|field2|field3|field4_valueA,field4_valueB,field4_valueC http://[myserver]/solr/[my corename]/update?commit=true&separator=|&escape=\&stream.file=[location of file]&fieldnames=field1,field2,field3,field4&optimize=true&stream.contentType=application/csv&f.field4.split=true&f.field4.separator=%2C After importing the data, I see similiar results as the below for the multi-valued field , field4: field4_valueA field4_valueB field4_valueC" (Why is there a trailing quote here?) I also noticed if only 1 value is being inserted into this multivalued field - there is no issue. It always happens on the last value. Thanks in advance, Cheers! Mike
Solr 4.4 Import from CSV to Multi-value field - Adds quote on last value
Solr Family, I'm a Solr 3.6 user who just pulled down 4.4 yesterday and noticed something a bit odd when importing into a multi-valued field. I wouldn't be surprised if there's a user-error on my end but hopefully there isn't a bug. Here's the situation. I created some test data to import and one field needs to be split into a multi-valued field. This data resides within a .csv file and is structured like the following: (below are replacement field names. Also note - there are no quotes " within the data.) field1|field2|field3|field4_valueA,field4_valueB,field4_valueC http://[myserver]/solr/[my corename]/update?commit=true&separator=|&escape=\&stream.file=[location of file]&fieldnames=field1,field2,field3,field4&optimize=true&stream.contentType=application/csv&f.field4.split=true&f.field4.separator=%2C After importing the data, I see similiar results as the below for the multi-valued field , field4: field4_valueA field4_valueB field4_valueC" (Why is there a trailing quote here?) I also noticed if only 1 value is being inserted into this multivalued field - there is no issue. It always happens on the last value. Thanks in advance, Cheers! Mike
Returning Hierarchical / Data Relationships In Solr 3.6 (or Solr 4 via Solr Join)
Solr User Group, I would like to return a hierarchical data relationship when somebody queries for a parent doc in solr. This sort of relationship doesn't currently exist in our core as the use-case has been to search for a specific document only. However, here's kind of an example of what's being asked: (not the same kind of relationship though, but a smiliar concept. However, there will always be only 1 parent to many children) User search's for a parent value and also gets child docs as part of the response (not child names as multi-valued fields) For example, say: select?qt=parentvalueearch&q=[parentValue] 1 [parentValue] parent John Doe M 2 child Chris Doe M 3 child Stacy Doe F At first I was thinking I could just add a field within each child doc to represent the parentValue, however, this family relationship is a bit more complex as children can be associated to many different parents (parent docs) so I don't want to tie the relationship off the child. On the flip side, it seems, I could have a multi-valued field with all the childnames within the parent doc and then requery the core for the child docs and append them to the response...the caveat there is this parent may have a few hundred children and not sure if a multi-valued field would make sense to store the children references...also this approach would dramatically increase the response time from on average 20ms to ~4sec assumming a parent has 200 children. Anybody solve for a similiar issue or have thoughts on best way to tackle this with version 3.6? Also could Solr Joins introduced in 4.X address this issue? (Not too familiar with it but seems to be related) Thanks in advance! Mike
Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Hey Shawn / Solr User Group, This makes perfect sense to me. Thanks for the thorough answer. "The CSV update handler works at a lower level than the DataImport handler, and doesn't have "clean" or "full-import" options, which defaults to clean=true. The DIH is like a full application embedded inside Solr, one that uses an update handler -- it is not itself an update handler. When clean=true or using full-import without a clean option, DIH itself sends a "delete all documents" update request." And similiarly, my assumption is in the event of a non-syntactical failure/interuption (such as a server crash) during the CSV Update a rollback (stream.body=) would also need to be manually requested (or automatted but outside of Solr) where as the DIH automates this Request on my behalf as well...? Is there anyway to detect this failure or interuption?...A real example is, I was in the process of indexing data via the CSV Update and somebody bounced the server before it completed. No actual errors were produced but it appeared that the CSV Update process stopped at the point of the reboot. My assumption is, if I had passed in a rollback, I'd get the previously indexed data , given I didn't request a delete beforehand (haven't yet tested this). But wondering, how I could automatically detect this? This I guess is where DIH starts gaining some merit. Also - the response that the DIH produces when the indexing process is complete appears to be a lot more mature in that it explicity suggest the index completed and that information can can be re-queried. It would be nice if the CSV Update provided a similiar response..my assumption is it would first need to know how many lines exist on the file in order to know whether or not the job actually completed... Also - outside of solr initiating a delete due to encountering the same UniqueKey, is there anything else that could cause a delete to be initiated by Solr? Lastly, is there any concern of running multiple Update CSV requests on different data files containing different data? Thanks in advance. This was very helpful. Mike From: Shawn Heisey To: solr-user@lucene.apache.org Sent: Monday, July 1, 2013 2:30 PM Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 On 7/1/2013 12:56 PM, Mike L. wrote: > Hey Ahmet / Solr User Group, > > I tried using the built in UpdateCSV and it runs A LOT faster than a >FileDataSource DIH as illustrated below. However, I am a bit confused about >the numDocs/maxDoc values when doing an import this way. Here's my Get command >against a Tab delimted file: (I removed server info and additional fields.. >everything else is the same) > > http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields > > > My response from solr > > > > 0 name="QTime">591 > > > I am experimenting with 2 csv files (1 with 10 records, the other with 1000) > to see If I can get this to run correctly before running my entire collection > of data. I initially loaded the first 1000 records to an empty core and that > seemed to work, however, but when running the above with a csv file that has > 10 records, I would like to see only 10 active records in my core. What I get > instead, when looking at my stats page: > > numDocs 1000 > maxDoc 1010 > > If I run the same url above while appending an 'optimize=true', I get: > > numDocs 1000, > maxDoc 1000. A discrepancy between numDocs and maxDoc indicates that there are deleted documents in your index. You might already know this, so here's an answer to what I think might be your actual question: If you want to delete the 1000 existing documents before adding the 10 documents, then you have to actually do that deletion. The CSV update handler works at a lower level than the DataImport handler, and doesn't have "clean" or "full-import" options, which defaults to clean=true. The DIH is like a full application embedded inside Solr, one that uses an update handler -- it is not itself an update handler. When clean=true or using full-import without a clean option, DIH itself sends a "delete all documents" update request. If you didn't already know the bit about the deleted documents, then read this: It can be normal for indexing "new" documents to cause deleted documents. This happens when you have the same value in your UniqueKey field as documents that are already in your index. Solr knows by the config you gave it that they are the same document, so it deletes the old one before adding the new one. Solr has no way to know whether the docu
Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
Hey Ahmet / Solr User Group, I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when doing an import this way. Here's my Get command against a Tab delimted file: (I removed server info and additional fields.. everything else is the same) http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields My response from solr 0591 I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If I can get this to run correctly before running my entire collection of data. I initially loaded the first 1000 records to an empty core and that seemed to work, however, but when running the above with a csv file that has 10 records, I would like to see only 10 active records in my core. What I get instead, when looking at my stats page: numDocs 1000 maxDoc 1010 If I run the same url above while appending an 'optimize=true', I get: numDocs 1000, maxDoc 1000. Perhaps the commit=true is not doing what its supposed to or am I missing something? I also trying passing a commit afterward like this: http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't seem to do anything either) From: Ahmet Arslan To: "solr-user@lucene.apache.org" ; Mike L. Sent: Saturday, June 29, 2013 7:20 AM Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 Hi Mike, You could try http://wiki.apache.org/solr/UpdateCSV And make sure you commit at the very end. From: Mike L. To: "solr-user@lucene.apache.org" Sent: Saturday, June 29, 2013 3:15 AM Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 I've been working on improving index time with a JdbcDataSource DIH based config and found it not to be as performant as I'd hoped for, for various reasons, not specifically due to solr. With that said, I decided to switch gears a bit and test out FileDataSource setup... I assumed by eliminiating network latency, I should see drastic improvements in terms of import time..but I'm a bit surprised that this process seems to run much slower, at least the way I've initially coded it. (below) The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing fancy here. The regex just seperates out the fields... Is there faster approach to doing this? If so, what is it? Also, what is the "recommended" approach in terms of index/importing data? I know thats may come across as a vague question as there are various options available, but which one would be considered the "standard" approach within a production enterprise environment. (below has been cleansed) Thanks in advance, Mike Thanks in advance, Mike
FileDataSource vs JdbcDataSouce (speed) Solr 3.5
I've been working on improving index time with a JdbcDataSource DIH based config and found it not to be as performant as I'd hoped for, for various reasons, not specifically due to solr. With that said, I decided to switch gears a bit and test out FileDataSource setup... I assumed by eliminiating network latency, I should see drastic improvements in terms of import time..but I'm a bit surprised that this process seems to run much slower, at least the way I've initially coded it. (below) The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing fancy here. The regex just seperates out the fields... Is there faster approach to doing this? If so, what is it? Also, what is the "recommended" approach in terms of index/importing data? I know thats may come across as a vague question as there are various options available, but which one would be considered the "standard" approach within a production enterprise environment. (below has been cleansed) Thanks in advance, Mike
Re: Parallal Import Process on same core. Solr 3.5
Thanks for the response. Here's the scrubbed version of my DIH: http://apaste.info/6uGH It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I don't know if this is a bug or not, but the batchSize="-1" streaming feature doesn't seem to work, at least with informix jdbc drivers. I set the batchsize to "500", but have tested it with various numbers including 5000, 1. I'm aware that behind the scenes this should be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of what value I actually use. I was told by one of our DBA's that our value is set as a global DB param and can't be modified (which I haven't looked into afterward.) As far as HEAP patterns, I watch the process via WILY and notice GC occurs every 15min's or so, but becomes infrequent and not as significant as the previous one. It's almost as if some memory is never released until it eventually catches up to the max heap size. I did assume that perhaps there could have been some locking issues, which is why I made the following modifications: readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED" What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? My general understanding is the higher the mergeFactor, the less frequent merges which should improve index time, but slow down query response time. I also read somewhere that an increase on the ramBufferSize should help prevent frequent merges...but confused why I didn't really see an improvement...perhaps my combination of these values wasn't right in relation to my total fetch size. Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e the defaults) the better on memory management, but cost on index time as you pay for the overhead of committing. That is a number I've been experimenting with as well and have scene some variations in heap trends but unfortunately, have not completed the job quite yet with any config... I did get very close.. I'd hate to throw additional memory at the problem if there is something else I can tweak.. Thanks! Mike From: Shawn Heisey To: solr-user@lucene.apache.org Sent: Wednesday, June 26, 2013 12:13 PM Subject: Re: Parallal Import Process on same core. Solr 3.5 On 6/26/2013 10:58 AM, Mike L. wrote: > > Hello, > > I'm trying to execute a parallel DIH process and running into heap >related issues, hoping somebody has experienced this and can recommend some >options.. > > Using Solr 3.5 on CentOS. > Currently have JVM heap 4GB min , 8GB max > > When executing the entities in a sequential process (entities executing >in sequence by default), my heap never exceeds 3GB. When executing the >parallel process, everything runs fine for roughly an hour, then I reach the >8GB max heap size and the process stalls/fails. > > More specifically, here's how I'm executing the parallel import process: >I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME >VALUE') within my entity queries. And within Solrconfig.xml, I've created >corresponding data import handlers, one for each of these entities. > > My total rows fetch/count is 9M records. > > And when I initiate the import, I call each one, similar to the below > (obviously I've stripped out my server & naming conventions. > > http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true > > http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2] > > > I assume that when doing this, only the first import request needs to contain > the clean=true param. > > I've divided each import query to target roughly the same amount of data, and > in solrconfig, I've tried various things in hopes to reduce heap size. Thanks for including some solrconfig snippets, but I think what we really need is your DIH configuration(s). Use a pastebin site and choose the proper document type. http://apaste.info/is available and the proper type there would be (X)HTML. If you need to sanitize these to remove host/user/pass, please replace the values with something else rather than deleting them entirely. With full-import, clean defaults to true, so including it doesn't change anything. What I would actually do is have clean=true on the first import you run, then after waiting a few seconds to be sure it is running, start the others with clean=false so that they don't do ANOTHER clean. I suspect that you might be running into JDBC driver behavior where the entire result set is being buffered into RAM. Thanks, Shawn
Parallal Import Process on same core. Solr 3.5
Hello, I'm trying to execute a parallel DIH process and running into heap related issues, hoping somebody has experienced this and can recommend some options.. Using Solr 3.5 on CentOS. Currently have JVM heap 4GB min , 8GB max When executing the entities in a sequential process (entities executing in sequence by default), my heap never exceeds 3GB. When executing the parallel process, everything runs fine for roughly an hour, then I reach the 8GB max heap size and the process stalls/fails. More specifically, here's how I'm executing the parallel import process: I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME VALUE') within my entity queries. And within Solrconfig.xml, I've created corresponding data import handlers, one for each of these entities. My total rows fetch/count is 9M records. And when I initiate the import, I call each one, similar to the below (obviously I've stripped out my server & naming conventions. http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2] I assume that when doing this, only the first import request needs to contain the clean=true param. I've divided each import query to target roughly the same amount of data, and in solrconfig, I've tried various things in hopes to reduce heap size. Here's my current config: false 15 100 2147483647 1 1000 1 single false 100 15 2147483647 1 false 6 25000 10 What gets tricky is finding the sweet spot with these parameters, but wondering if anybody has any recommendations for an optimal config. Also, regarding autoCommit, I've even turned that feature off, but my heap size reaches its max sooner. I am wondering though, what would be the difference with autoCommit and passing in the commit=true param on each import query. Thanks in advance! Mike