Re: Extract info from parent node during data import
On Fri, Sep 11, 2009 at 6:48 AM, venn hardy venn.ha...@hotmail.com wrote: Hi Fergus, When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name). I am not entirely sure how the forEach statement works, but my interpretation of forEach=/document/category/item | /document/category is something like this: 1. Whenever DIH encounters a document/category it will extract the /document/category/ name field as a common field 2. Whenever DIH encounters a document/category/item it will extract all of the item fields. 3. When all fields have been encountered, save the document in solr and go to the next category/item /document/category/item | /document/category means there are two paths which triggers a new doc (it is possible to have more). Whenever it encounters the closing tag of that xpath , it emits all the fields it collected since the opening of the same tag. after that it clears all the fields it collected since the opening of the tag. If there are fields it collected before opening of the same tag, it retains it Nice and clear, but that is not what I see. With my test case with forEach=/record | /record/mediaBlock I see that for each /record/mediaBlock document indexed it contains all fields from the parent /record document as well. A search over mediaBlock s returns lots of extra fields from the parent which did not have the commonField attribute. I will try and produce a testcase. Date: Thu, 10 Sep 2009 14:19:31 +0100 To: solr-user@lucene.apache.org From: fer...@twig.me.uk Subject: RE: Extract info from parent node during data import Hi Paul, The forEach=/document/category/item | /document/category/name didn't work (no categoryname was stored or indexed). However forEach=/document/category/item | /document/category seems to work well. I am not sure why category on its own works, but not category/name... But thanks for tip. It wasn't as painful as I thought it would be. Venn Hmmm, I had bother with this. Although each occurance of /document/category/item causes a new solr document to indexed, that document contained all the fields from the parent element as well. Did you see this? From: noble.p...@corp.aol.com Date: Thu, 10 Sep 2009 09:58:21 +0530 Subject: Re: Extract info from parent node during data import To: solr-user@lucene.apache.org try this add two xpaths in your forEach forEach=/document/category/item | /document/category/name and add a field as follows field column=catgoryname xpath =/document/category/name commonField=true/ Please try it out and let me know. On Thu, Sep 10, 2009 at 7:30 AM, venn hardy venn.ha...@hotmail.com wrote: Hello, I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler. Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time). The bulk of my content is contained within each item tag. However, each item has a parent called category and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following: - category: Category 1; id: 1; author: Author 1 - category: Category 1; id: 2; author: Author 2 - category: Category 2; id: 3; author: Author 3 - category: Category 2; id: 4; author: Author 4 Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit? Thanks This is what my XML document looks like: document category nameCategory 1/name item id1/id authorAuthor 1/author /item item id2/id authorAuthor 2/author /item /category category nameCategory 2/name item id3/id authorAuthor 3/author /item item id4/id authorAuthor 4/author /item /category /document And this is what my dataConfig looks like: dataConfig dataSource type=URLDataSource / document entity name=archive pk=id url=http://localhost:9080/data/20090817070752.xml; processor=XPathEntityProcessor forEach=/document/category/item
Re: Extract info from parent node during data import
On Sat, Sep 12, 2009 at 12:24 PM, Fergus McMenemie fer...@twig.me.uk wrote: On Fri, Sep 11, 2009 at 6:48 AM, venn hardy venn.ha...@hotmail.com wrote: Hi Fergus, When I debugged in the development console http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport I had no problems. Each category/item seems to be only indexed once, and no parent fields are available (except the category name). I am not entirely sure how the forEach statement works, but my interpretation of forEach=/document/category/item | /document/category is something like this: 1. Whenever DIH encounters a document/category it will extract the /document/category/ name field as a common field 2. Whenever DIH encounters a document/category/item it will extract all of the item fields. 3. When all fields have been encountered, save the document in solr and go to the next category/item /document/category/item | /document/category means there are two paths which triggers a new doc (it is possible to have more). Whenever it encounters the closing tag of that xpath , it emits all the fields it collected since the opening of the same tag. after that it clears all the fields it collected since the opening of the tag. If there are fields it collected before opening of the same tag, it retains it Nice and clear, but that is not what I see. With my test case with forEach=/record | /record/mediaBlock I see that for each /record/mediaBlock document indexed it contains all fields from the parent /record document as well. A search over mediaBlock s returns lots of extra fields from the parent which did not have the commonField attribute. I will try and produce a testcase yes it does . . /record/mediaBlock will have all the fields collected from /record as well. It is by design . Date: Thu, 10 Sep 2009 14:19:31 +0100 To: solr-user@lucene.apache.org From: fer...@twig.me.uk Subject: RE: Extract info from parent node during data import Hi Paul, The forEach=/document/category/item | /document/category/name didn't work (no categoryname was stored or indexed). However forEach=/document/category/item | /document/category seems to work well. I am not sure why category on its own works, but not category/name... But thanks for tip. It wasn't as painful as I thought it would be. Venn Hmmm, I had bother with this. Although each occurance of /document/category/item causes a new solr document to indexed, that document contained all the fields from the parent element as well. Did you see this? From: noble.p...@corp.aol.com Date: Thu, 10 Sep 2009 09:58:21 +0530 Subject: Re: Extract info from parent node during data import To: solr-user@lucene.apache.org try this add two xpaths in your forEach forEach=/document/category/item | /document/category/name and add a field as follows field column=catgoryname xpath =/document/category/name commonField=true/ Please try it out and let me know. On Thu, Sep 10, 2009 at 7:30 AM, venn hardy venn.ha...@hotmail.com wrote: Hello, I am using SOLR 1.4 (from nighly build) and its URLDataSource in conjunction with the XPathEntityProcessor. I have successfully imported XML content, but I think I may have found a limitation when it comes to the commonField attribute in the DataImportHandler. Before writing my own parser to read in a whole XML document, I thought I'd post the question here (since I got some great advice last time). The bulk of my content is contained within each item tag. However, each item has a parent called category and each category has a name which I would like to import. In my forEach loop I specify the /document/category/item as the collection of items I am interested in. Is there anyway to extract an element from underneath a parent node? To be a more more specific (see eg xml below). I would like to index the following: - category: Category 1; id: 1; author: Author 1 - category: Category 1; id: 2; author: Author 2 - category: Category 2; id: 3; author: Author 3 - category: Category 2; id: 4; author: Author 4 Any ideas on how I can get to a parent node from within a child during data import? If it cant be done, what do you suggest would be the best way so I can keep using the DataImportHandler... would XSLT be a good idea to 'flatten out' the structure a bit? Thanks This is what my XML document looks like: document category nameCategory 1/name item id1/id authorAuthor 1/author /item item id2/id authorAuthor 2/author /item /category category nameCategory 2/name item id3/id authorAuthor 3/author /item item id4/id authorAuthor 4/author /item /category /document And this is what my dataConfig looks like: dataConfig dataSource type=URLDataSource /
Re: Single Core or Multiple Core?
+1 Can you add a JIRA issue for that so we can vote for it? Chris Hostetter wrote: : For the record: even if you're only going to have one SOlrCore, using the : multicore support (ie: having a solr.xml file) might prove handy from a : maintence standpoint ... the ability to configure new on deck cores with ... : Yeah, it is a shame that single-core deployments (no solr.xml) does not have : a way to enable CoreAdminHandler. This is something we should definitely : look at in Solr 1.5. I think the most straight forward starting point is to switch how we structure the examples so that all of the examples uses a solr.xml with multicore support. Then we can move forward on deprecating the specification of Solr Home using JNDI/systemvars and switch to having the location of the solr.xml be the one master config option with everything else coming after that. -Hoss
Re: Facet Response Structure
As to point 1 - this is not a problem with the response structure I've outlined. This is exactly the problem I'm trying to solve. NULL is not a value in the field, it is a placeholder to indicate how many documents the field does not exist for. In my example response structure above, 'missing' is placed outside of the 'facets' list, clearing up the confusion. 'missing' could indeed be a facet value without any collisions. To point 2 - I understand it would cause compatibility issues, that is why I was suggesting it be incorporated into the next SOLR release. I'd also be willing to work Regarding the stats component, it does not do what you think it does. It reports a count of all values, not distinct values. The stats component also strictly works on numeric fields, which would make it impossible to use in a lot of cases where the FacetComponent does work. Shalin Shekhar Mangar wrote: On Sat, Sep 12, 2009 at 1:20 AM, smock harish.agar...@gmail.com wrote: I'd like to propose a change to the facet response structure. Currently, it looks like: {'facet_fields':{'field1':[('value1',count1),('value2',count2),(null,missingCount)]}} My immediate problem with this structure is that null is not of the same type as the 'value's. Also, the meaning of the (null,missingCount) tuple is not the same as the meaning of the ('value',count) tuples, it is a special case to represent the documents for which the field has no value. I'd like to propose changing the response to: {'facet_fields',:{'field1':{'facets':[('value1',count1),('value2',count2)],'missing':missingCount}}} Well, there are two problems: 1. 'missing' can be a value in the field 2. Facet support has been there for a long time. This would break compatibility with existing clients. In addition to cleaning up the 'null' issue mentioned above, I think this will allow for greater flexibility moving forward with the facet component. For instance, it would be great if the FacetComponent could add an optional count of the 'hits', or number of distinct facet values contained in the query result. If the facet request has a limit on it, this number is not available via a count of the returned facet values. The response structure I've outlined above could accomodate this piece of metadata very easily: {'facet_fields',:{'field1':{'facets':[('value1',count1),('value2',count2)],'missing':missingCount,'hits':hitsCount}}} Have you looked at StatsComponent? It give counts for total distinct values and count of documents missing a value among other things: http://wiki.apache.org/solr/StatsComponent -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Facet-Response-Structure-tp25407363p25414267.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: standard requestHandler components
Hi Jay, I got it from reading your response. I did browse around in solrconfig.xml but could not find any components configured for 'standard', but didn't realized that there are 'defaults' hardwired. Thanks for your quick detailed response and also your additional tip on spellcheck config. You saved me lots of time on trial--error. Regards, Michael Jay Hill wrote: RequestHandlers are configured in solrconfig.xml. If no components are explicitly declared in the request handler config the the defaults are used. They are: - QueryComponent - FacetComponent - MoreLikeThisComponent - HighlightComponent - StatsComponent - DebugComponent If you wanted to have a custom list of components (either omitting defaults or adding custom) you can specify the components for a handler directly: arr name=components strquery/str strfacet/str strmlt/str strhighlight/str strdebug/str strsomeothercomponent/str /arr You can add components before or after the main ones like this: arr name=first-components strmycomponent/str /arr arr name=last-components strmyothercomponent/str /arr and that's how the spell check component can be added: arr name=last-components strspellcheck/str /arr Note that the a component (except the defaults) must be configured in solrconfig.xml with the name used in the str element as well. Have a look at the solrconfig.xml in the example directory (.../example/solr/conf/) for examples on how to set up the spellcheck component, and on how the request handlers are configured. -Jay http://www.lucidimagination.com On Fri, Sep 11, 2009 at 3:04 PM, michael8 mich...@saracatech.com wrote: Hi, I have a newbie question about the 'standard' requestHandler in solrconfig.xml. What I like to know is where is the config information for this requestHandler kept? When I go to http://localhost:8983/solr/admin, I see the following info, but am curious where are the supposedly 'chained' components (e.g. QueryComponent, FacetComponent, MoreLikeThisComponent) configured for this requestHandler. I see timing and process debug output from these components with debugQuery=true, so somewhere these components must have been configured for this 'standard' requestHandler. name:standard class: org.apache.solr.handler.component.SearchHandler version:$Revision: 686274 $ description:Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1252703405335 requests : 3 errors : 0 timeouts : 0 totalTime : 201 avgTimePerRequest : 67.0 avgRequestsPerSecond : 0.015179728 What I like to do from understanding this is to properly integrate spellcheck component into the standard requestHandler as suggested in a solr spellcheck example. Thanks for any info in advance. Michael -- View this message in context: http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25409075.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25414682.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr SVN build problem
Should be fixed in trunk. Try updating and see if it works for you See: https://issues.apache.org/jira/browse/SOLR-1424 On Sep 9, 2009, at 8:12 PM, Allahbaksh Asadullah wrote: Hi , I am building Solr from source. During building it from source I am getting following error. generate-maven-artifacts: [mkdir] Created dir: c:\Downloads\solr_trunk\build\maven [mkdir] Created dir: c:\Downloads\solr_trunk\dist\maven [copy] Copying 1 file to c:\Downloads\solr_trunk\build\maven\c:\Downloads\s olr_trunk\src\maven BUILD FAILED c:\Downloads\solr_trunk\build.xml:741: The following error occurred while execut ing this line: c:\Downloads\solr_trunk\common-build.xml:261: Failed to copy c:\Downloads\solr_t runk\src\maven\solr-parent-pom.xml.template to c:\Downloads\solr_trunk\build\mav en\c:\Downloads\solr_trunk\src\maven\solr-parent-pom.xml.template due to java.io .FileNotFoundException c:\Downloads\solr_trunk\build\maven\c:\Downloads\solr_tru nk\src\maven\solr-parent-pom.xml.template (The filename, directory name, or volu me label syntax is incorrect) Regards, Allahbaksh
Re: Single Core or Multiple Core?
What do you mean by single-core deployments does not have a way to enable CoreAdminHandler?I'm just trying to understand the feature that you are talking about On Sat, Sep 12, 2009 at 6:44 AM, Uri Boness ubon...@gmail.com wrote: +1 Can you add a JIRA issue for that so we can vote for it? Chris Hostetter wrote: : For the record: even if you're only going to have one SOlrCore, using the : multicore support (ie: having a solr.xml file) might prove handy from a : maintence standpoint ... the ability to configure new on deck cores with ... : Yeah, it is a shame that single-core deployments (no solr.xml) does not have : a way to enable CoreAdminHandler. This is something we should definitely : look at in Solr 1.5. I think the most straight forward starting point is to switch how we structure the examples so that all of the examples uses a solr.xml with multicore support. Then we can move forward on deprecating the specification of Solr Home using JNDI/systemvars and switch to having the location of the solr.xml be the one master config option with everything else coming after that. -Hoss
Re: Highlighting in SolrJ?
Will do Shalin. -Jay http://www.lucidimagination.com On Fri, Sep 11, 2009 at 9:23 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Jay, it would be great if you can add this example to the Solrj wiki: http://wiki.apache.org/solr/Solrj On Fri, Sep 11, 2009 at 5:15 AM, Jay Hill jayallenh...@gmail.com wrote: Set up the query like this to highlight a field named content: SolrQuery query = new SolrQuery(); query.setQuery(foo); query.setHighlight(true).setHighlightSnippets(1); //set other params as needed query.setParam(hl.fl, content); QueryResponse queryResponse =getSolrServer().query(query); Then to get back the highlight results you need something like this: IteratorSolrDocument iter = queryResponse.getResults(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue(content)); String id = (String) resultDoc.getFieldValue(id); //id is the uniqueKey field if (queryResponse.getHighlighting().get(id) != null) { ListString highightSnippets = queryResponse.getHighlighting().get(id).get(content); } } Hope that gets you what you need. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote: Can somebody point me to some sample code for using highlighting in SolrJ? I understand the highlighted versions of the field comes in a separate NamedList? How does that work? -- http://www.linkedin.com/in/paultomblin -- Regards, Shalin Shekhar Mangar.
Re: Single Core or Multiple Core?
On Sat, Sep 12, 2009 at 9:45 PM, Jonathan Ariel ionat...@gmail.com wrote: What do you mean by single-core deployments does not have a way to enable CoreAdminHandler?I'm just trying to understand the feature that you are talking about I'm talking about the core related commands described here: http://wiki.apache.org/solr/CoreAdmin -- Regards, Shalin Shekhar Mangar.
Re: Facet Response Structure
On Sat, Sep 12, 2009 at 6:29 PM, smock harish.agar...@gmail.com wrote: As to point 1 - this is not a problem with the response structure I've outlined. This is exactly the problem I'm trying to solve. NULL is not a value in the field, it is a placeholder to indicate how many documents the field does not exist for. In my example response structure above, 'missing' is placed outside of the 'facets' list, clearing up the confusion. 'missing' could indeed be a facet value without any collisions. You are right, I missed that. To point 2 - I understand it would cause compatibility issues, that is why I was suggesting it be incorporated into the next SOLR release. I'd also be willing to work I'm not convinced that it is something that needs to be changed. I'm also not sure about the right way to deprecate a widely used response format. Go ahead and raise an issue if you want and we can collect thoughts from others. Regarding the stats component, it does not do what you think it does. It reports a count of all values, not distinct values. The stats component also strictly works on numeric fields, which would make it impossible to use in a lot of cases where the FacetComponent does work. Yes, my bad. Though it does report the count of missing values. -- Regards, Shalin Shekhar Mangar.