Re: Extract info from parent node during data import

2009-09-12 Thread Fergus McMenemie
On Fri, Sep 11, 2009 at 6:48 AM, venn hardy venn.ha...@hotmail.com wrote:

 Hi Fergus,

 When I debugged in the development console 
 http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport

 I had no problems. Each category/item seems to be only indexed once, and no 
 parent fields are available (except the category name).

 I am not entirely sure how the forEach statement works, but my 
 interpretation of forEach=/document/category/item | /document/category is 
 something like this:

 1. Whenever DIH encounters a document/category it will extract the 
 /document/category/

 name field as a common field
 2. Whenever DIH encounters a document/category/item it will extract all of 
 the item fields.
 3. When all fields have been encountered, save the document in solr and go 
 to the next category/item

/document/category/item | /document/category

means there are two paths which triggers a new doc (it is possible to
have more). Whenever it encounters the closing tag of that xpath , it
emits all the fields it collected since the opening of the same tag.
after that it clears all the fields it collected since the opening of
the tag.

If there are fields it collected before opening of the same tag, it retains it


Nice and clear, but that is not what I see.

With my test case with forEach=/record | /record/mediaBlock
I see that for each /record/mediaBlock document indexed it contains all fields
from the parent /record document as well. A search over mediaBlock s returns 
lots
of extra fields from the parent which did not have the commonField attribute. I 
will try and produce a testcase.




 Date: Thu, 10 Sep 2009 14:19:31 +0100
 To: solr-user@lucene.apache.org
 From: fer...@twig.me.uk
 Subject: RE: Extract info from parent node during data import

 Hi Paul,
 The forEach=/document/category/item | /document/category/name didn't 
 work (no categoryname was stored or indexed).
 However forEach=/document/category/item | /document/category seems to 
 work well. I am not sure why category on its own works, but not 
 category/name...
 But thanks for tip. It wasn't as painful as I thought it would be.
 Venn

 Hmmm, I had bother with this. Although each occurance of 
 /document/category/item
 causes a new solr document to indexed, that document contained all the 
 fields from
 the parent element as well.

 Did you see this?

 
  From: noble.p...@corp.aol.com
  Date: Thu, 10 Sep 2009 09:58:21 +0530
  Subject: Re: Extract info from parent node during data import
  To: solr-user@lucene.apache.org
 
  try this
 
  add two xpaths in your forEach
 
  forEach=/document/category/item | /document/category/name
 
  and add a field as follows
 
  field column=catgoryname xpath =/document/category/name
  commonField=true/
 
  Please try it out and let me know.
 
  On Thu, Sep 10, 2009 at 7:30 AM, venn hardy venn.ha...@hotmail.com 
  wrote:
  
   Hello,
  
  
  
   I am using SOLR 1.4 (from nighly build) and its URLDataSource in 
   conjunction with the XPathEntityProcessor. I have successfully 
   imported XML content, but I think I may have found a limitation when 
   it comes to the commonField attribute in the DataImportHandler.
  
  
  
   Before writing my own parser to read in a whole XML document, I 
   thought I'd post the question here (since I got some great advice last 
   time).
  
  
  
   The bulk of my content is contained within each item tag. However, 
   each item has a parent called category and each category has a name 
   which I would like to import. In my forEach loop I specify the 
   /document/category/item as the collection of items I am interested in. 
   Is there anyway to extract an element from underneath a parent node? 
   To be a more more specific (see eg xml below). I would like to index 
   the following:
  
   - category: Category 1; id: 1; author: Author 1
  
   - category: Category 1; id: 2; author: Author 2
  
   - category: Category 2; id: 3; author: Author 3
  
   - category: Category 2; id: 4; author: Author 4
  
  
  
   Any ideas on how I can get to a parent node from within a child during 
   data import? If it cant be done, what do you suggest would be the best 
   way so I can keep using the DataImportHandler... would XSLT be a good 
   idea to 'flatten out' the structure a bit?
  
  
  
   Thanks
  
  
  
   This is what my XML document looks like:
  
   document
   category
   nameCategory 1/name
   item
   id1/id
   authorAuthor 1/author
   /item
   item
   id2/id
   authorAuthor 2/author
   /item
   /category
   category
   nameCategory 2/name
   item
   id3/id
   authorAuthor 3/author
   /item
   item
   id4/id
   authorAuthor 4/author
   /item
   /category
   /document
  
  
  
   And this is what my dataConfig looks like:
   dataConfig
   dataSource type=URLDataSource /
   document
   entity name=archive pk=id 
   url=http://localhost:9080/data/20090817070752.xml; 
   processor=XPathEntityProcessor forEach=/document/category/item 
   

Re: Extract info from parent node during data import

2009-09-12 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Sat, Sep 12, 2009 at 12:24 PM, Fergus McMenemie fer...@twig.me.uk wrote:
On Fri, Sep 11, 2009 at 6:48 AM, venn hardy venn.ha...@hotmail.com wrote:

 Hi Fergus,

 When I debugged in the development console 
 http://localhost:9080/solr/admin/dataimport.jsp?handler=/dataimport

 I had no problems. Each category/item seems to be only indexed once, and no 
 parent fields are available (except the category name).

 I am not entirely sure how the forEach statement works, but my 
 interpretation of forEach=/document/category/item | /document/category is 
 something like this:

 1. Whenever DIH encounters a document/category it will extract the 
 /document/category/

 name field as a common field
 2. Whenever DIH encounters a document/category/item it will extract all of 
 the item fields.
 3. When all fields have been encountered, save the document in solr and go 
 to the next category/item

/document/category/item | /document/category

means there are two paths which triggers a new doc (it is possible to
have more). Whenever it encounters the closing tag of that xpath , it
emits all the fields it collected since the opening of the same tag.
after that it clears all the fields it collected since the opening of
the tag.

If there are fields it collected before opening of the same tag, it retains it


 Nice and clear, but that is not what I see.

 With my test case with forEach=/record | /record/mediaBlock
 I see that for each /record/mediaBlock document indexed it contains all 
 fields
 from the parent /record document as well. A search over mediaBlock s 
 returns lots
 of extra fields from the parent which did not have the commonField attribute. 
 I
 will try and produce a testcase

yes it does . . /record/mediaBlock will have all the fields collected
from /record as well. It is by design
.




 Date: Thu, 10 Sep 2009 14:19:31 +0100
 To: solr-user@lucene.apache.org
 From: fer...@twig.me.uk
 Subject: RE: Extract info from parent node during data import

 Hi Paul,
 The forEach=/document/category/item | /document/category/name didn't 
 work (no categoryname was stored or indexed).
 However forEach=/document/category/item | /document/category seems to 
 work well. I am not sure why category on its own works, but not 
 category/name...
 But thanks for tip. It wasn't as painful as I thought it would be.
 Venn

 Hmmm, I had bother with this. Although each occurance of 
 /document/category/item
 causes a new solr document to indexed, that document contained all the 
 fields from
 the parent element as well.

 Did you see this?

 
  From: noble.p...@corp.aol.com
  Date: Thu, 10 Sep 2009 09:58:21 +0530
  Subject: Re: Extract info from parent node during data import
  To: solr-user@lucene.apache.org
 
  try this
 
  add two xpaths in your forEach
 
  forEach=/document/category/item | /document/category/name
 
  and add a field as follows
 
  field column=catgoryname xpath =/document/category/name
  commonField=true/
 
  Please try it out and let me know.
 
  On Thu, Sep 10, 2009 at 7:30 AM, venn hardy venn.ha...@hotmail.com 
  wrote:
  
   Hello,
  
  
  
   I am using SOLR 1.4 (from nighly build) and its URLDataSource in 
   conjunction with the XPathEntityProcessor. I have successfully 
   imported XML content, but I think I may have found a limitation when 
   it comes to the commonField attribute in the DataImportHandler.
  
  
  
   Before writing my own parser to read in a whole XML document, I 
   thought I'd post the question here (since I got some great advice 
   last time).
  
  
  
   The bulk of my content is contained within each item tag. However, 
   each item has a parent called category and each category has a name 
   which I would like to import. In my forEach loop I specify the 
   /document/category/item as the collection of items I am interested 
   in. Is there anyway to extract an element from underneath a parent 
   node? To be a more more specific (see eg xml below). I would like to 
   index the following:
  
   - category: Category 1; id: 1; author: Author 1
  
   - category: Category 1; id: 2; author: Author 2
  
   - category: Category 2; id: 3; author: Author 3
  
   - category: Category 2; id: 4; author: Author 4
  
  
  
   Any ideas on how I can get to a parent node from within a child 
   during data import? If it cant be done, what do you suggest would be 
   the best way so I can keep using the DataImportHandler... would XSLT 
   be a good idea to 'flatten out' the structure a bit?
  
  
  
   Thanks
  
  
  
   This is what my XML document looks like:
  
   document
   category
   nameCategory 1/name
   item
   id1/id
   authorAuthor 1/author
   /item
   item
   id2/id
   authorAuthor 2/author
   /item
   /category
   category
   nameCategory 2/name
   item
   id3/id
   authorAuthor 3/author
   /item
   item
   id4/id
   authorAuthor 4/author
   /item
   /category
   /document
  
  
  
   And this is what my dataConfig looks like:
   dataConfig
   dataSource type=URLDataSource /
   

Re: Single Core or Multiple Core?

2009-09-12 Thread Uri Boness

+1
Can you add a JIRA issue for that so we can vote for it?

Chris Hostetter wrote:

:  For the record: even if you're only going to have one SOlrCore, using the
:  multicore support (ie: having a solr.xml file) might prove handy from a
:  maintence standpoint ... the ability to configure new on deck cores with
...
: Yeah, it is a shame that single-core deployments (no solr.xml) does not have
: a way to enable CoreAdminHandler. This is something we should definitely
: look at in Solr 1.5.

I think the most straight forward starting point is to switch how we 
structure the examples so that all of the examples uses a solr.xml with 
multicore support.


Then we can move forward on deprecating the specification of Solr Home 
using JNDI/systemvars and switch to having the location of the solr.xml be 
the one master config option with everything else coming after that.




-Hoss


  


Re: Facet Response Structure

2009-09-12 Thread smock

As to point 1 - this is not a problem with the response structure I've
outlined.  This is exactly the problem I'm trying to solve.  NULL is not a
value in the field, it is a placeholder to indicate how many documents the
field does not exist for.  In my example response structure above, 'missing'
is placed outside of the 'facets' list, clearing up the confusion. 
'missing' could indeed be a facet value without any collisions.

To point 2 - I understand it would cause compatibility issues, that is why I
was suggesting it be incorporated into the next SOLR release.  I'd also be
willing to work 

Regarding the stats component, it does not do what you think it does.  It
reports a count of all values, not distinct values.  The stats component
also strictly works on numeric fields, which would make it impossible to use
in a lot of cases where the FacetComponent does work.


Shalin Shekhar Mangar wrote:
 
 On Sat, Sep 12, 2009 at 1:20 AM, smock harish.agar...@gmail.com wrote:
 

 I'd like to propose a change to the facet response structure.  Currently,
 it
 looks like:

 {'facet_fields':{'field1':[('value1',count1),('value2',count2),(null,missingCount)]}}

 My immediate problem with this structure is that null is not of the same
 type as the 'value's.  Also, the meaning of the (null,missingCount) tuple
 is
 not the same as the meaning of the ('value',count) tuples, it is a
 special
 case to represent the documents for which the field has no value.  I'd
 like
 to propose changing the response to:

 {'facet_fields',:{'field1':{'facets':[('value1',count1),('value2',count2)],'missing':missingCount}}}


 Well, there are two problems:
 1. 'missing' can be a value in the field
 2. Facet support has been there for a long time. This would break
 compatibility with existing clients.
 
 

 In addition to cleaning up the 'null' issue mentioned above, I think this
 will allow for greater flexibility moving forward with the facet
 component.
 For instance, it would be great if the FacetComponent could add an
 optional
 count of the 'hits', or number of distinct facet values contained in the
 query result.  If the facet request has a limit on it, this number is not
 available via a count of the returned facet values.  The response
 structure
 I've outlined above could accomodate this piece of metadata very easily:

 {'facet_fields',:{'field1':{'facets':[('value1',count1),('value2',count2)],'missing':missingCount,'hits':hitsCount}}}


 Have you looked at StatsComponent? It give counts for total distinct
 values
 and count of documents missing a value among other things:
 
 http://wiki.apache.org/solr/StatsComponent
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Facet-Response-Structure-tp25407363p25414267.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: standard requestHandler components

2009-09-12 Thread michael8

Hi Jay,

I got it from reading your response.  I did browse around in solrconfig.xml
but could not find any components configured for 'standard', but didn't
realized that there are 'defaults' hardwired.  Thanks for your quick 
detailed response and also your additional tip on spellcheck config.  You
saved me lots of time on trial--error.

Regards,
Michael


Jay Hill wrote:
 
 RequestHandlers are configured in solrconfig.xml. If no components are
 explicitly declared in the request handler config the the defaults are
 used.
 They are:
 - QueryComponent
 - FacetComponent
 - MoreLikeThisComponent
 - HighlightComponent
 - StatsComponent
 - DebugComponent
 
 If you wanted to have a custom list of components (either omitting
 defaults
 or adding custom) you can specify the components for a handler directly:
 arr name=components
   strquery/str
   strfacet/str
   strmlt/str
   strhighlight/str
   strdebug/str
   strsomeothercomponent/str
 /arr
 
 You can add components before or after the main ones like this:
 arr name=first-components
   strmycomponent/str
 /arr
 
 arr name=last-components
   strmyothercomponent/str
 /arr
 
 and that's how the spell check component can be added:
 arr name=last-components
   strspellcheck/str
 /arr
 
 Note that the a component (except the defaults) must be configured in
 solrconfig.xml with the name used in the str element as well.
 
 Have a look at the solrconfig.xml in the example directory
 (.../example/solr/conf/) for examples on how to set up the spellcheck
 component, and on how the request handlers are configured.
 
 -Jay
 http://www.lucidimagination.com
 
 
 On Fri, Sep 11, 2009 at 3:04 PM, michael8 mich...@saracatech.com wrote:
 

 Hi,

 I have a newbie question about the 'standard' requestHandler in
 solrconfig.xml.  What I like to know is where is the config information
 for
 this requestHandler kept?  When I go to http://localhost:8983/solr/admin,
 I
 see the following info, but am curious where are the supposedly 'chained'
 components (e.g. QueryComponent, FacetComponent, MoreLikeThisComponent)
 configured for this requestHandler.  I see timing and process debug
 output
 from these components with debugQuery=true, so somewhere these
 components
 must have been configured for this 'standard' requestHandler.

 name:standard
 class:  org.apache.solr.handler.component.SearchHandler
 version:$Revision: 686274 $
 description:Search using components:

 org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.DebugComponent,
 stats:  handlerStart : 1252703405335
 requests : 3
 errors : 0
 timeouts : 0
 totalTime : 201
 avgTimePerRequest : 67.0
 avgRequestsPerSecond : 0.015179728


 What I like to do from understanding this is to properly integrate
 spellcheck component into the standard requestHandler as suggested in a
 solr
 spellcheck example.

 Thanks for any info in advance.
 Michael
 --
 View this message in context:
 http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25409075.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25414682.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr SVN build problem

2009-09-12 Thread Ryan McKinley

Should be fixed in trunk.  Try updating and see if it works for you

See:
https://issues.apache.org/jira/browse/SOLR-1424



On Sep 9, 2009, at 8:12 PM, Allahbaksh Asadullah wrote:


Hi ,
I am building Solr from source. During building it from source I am  
getting

following error.

generate-maven-artifacts:
   [mkdir] Created dir: c:\Downloads\solr_trunk\build\maven
   [mkdir] Created dir: c:\Downloads\solr_trunk\dist\maven
[copy] Copying 1 file to
c:\Downloads\solr_trunk\build\maven\c:\Downloads\s
olr_trunk\src\maven

BUILD FAILED
c:\Downloads\solr_trunk\build.xml:741: The following error occurred  
while

execut
ing this line:
c:\Downloads\solr_trunk\common-build.xml:261: Failed to copy
c:\Downloads\solr_t
runk\src\maven\solr-parent-pom.xml.template to
c:\Downloads\solr_trunk\build\mav
en\c:\Downloads\solr_trunk\src\maven\solr-parent-pom.xml.template  
due to

java.io
.FileNotFoundException
c:\Downloads\solr_trunk\build\maven\c:\Downloads\solr_tru
nk\src\maven\solr-parent-pom.xml.template (The filename, directory  
name, or

volu
me label syntax is incorrect)

Regards,
Allahbaksh




Re: Single Core or Multiple Core?

2009-09-12 Thread Jonathan Ariel
What do you mean by single-core deployments does not have a way to enable
CoreAdminHandler?I'm just trying to understand the feature that you are
talking about

On Sat, Sep 12, 2009 at 6:44 AM, Uri Boness ubon...@gmail.com wrote:

 +1
 Can you add a JIRA issue for that so we can vote for it?


 Chris Hostetter wrote:

 :  For the record: even if you're only going to have one SOlrCore, using
 the
 :  multicore support (ie: having a solr.xml file) might prove handy from
 a
 :  maintence standpoint ... the ability to configure new on deck cores
 with
...
 : Yeah, it is a shame that single-core deployments (no solr.xml) does not
 have
 : a way to enable CoreAdminHandler. This is something we should definitely
 : look at in Solr 1.5.

 I think the most straight forward starting point is to switch how we
 structure the examples so that all of the examples uses a solr.xml with
 multicore support.

 Then we can move forward on deprecating the specification of Solr Home
 using JNDI/systemvars and switch to having the location of the solr.xml be
 the one master config option with everything else coming after that.



 -Hoss







Re: Highlighting in SolrJ?

2009-09-12 Thread Jay Hill
Will do Shalin.

-Jay
http://www.lucidimagination.com


On Fri, Sep 11, 2009 at 9:23 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Jay, it would be great if you can add this example to the Solrj wiki:

 http://wiki.apache.org/solr/Solrj

 On Fri, Sep 11, 2009 at 5:15 AM, Jay Hill jayallenh...@gmail.com wrote:

  Set up the query like this to highlight a field named content:
 
 SolrQuery query = new SolrQuery();
 query.setQuery(foo);
 
 query.setHighlight(true).setHighlightSnippets(1); //set other params
 as
  needed
 query.setParam(hl.fl, content);
 
 QueryResponse queryResponse =getSolrServer().query(query);
 
  Then to get back the highlight results you need something like this:
 
 IteratorSolrDocument iter = queryResponse.getResults();
 
 while (iter.hasNext()) {
   SolrDocument resultDoc = iter.next();
 
   String content = (String) resultDoc.getFieldValue(content));
   String id = (String) resultDoc.getFieldValue(id); //id is the
  uniqueKey field
 
   if (queryResponse.getHighlighting().get(id) != null) {
 ListString highightSnippets =
  queryResponse.getHighlighting().get(id).get(content);
   }
 }
 
  Hope that gets you what you need.
 
  -Jay
  http://www.lucidimagination.com
 
  On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com
 wrote:
 
   Can somebody point me to some sample code for using highlighting in
   SolrJ?  I understand the highlighted versions of the field comes in a
   separate NamedList?  How does that work?
  
   --
   http://www.linkedin.com/in/paultomblin
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Single Core or Multiple Core?

2009-09-12 Thread Shalin Shekhar Mangar
On Sat, Sep 12, 2009 at 9:45 PM, Jonathan Ariel ionat...@gmail.com wrote:

 What do you mean by single-core deployments does not have a way to enable
 CoreAdminHandler?I'm just trying to understand the feature that you are
 talking about


I'm talking about the core related commands described here:

http://wiki.apache.org/solr/CoreAdmin

-- 
Regards,
Shalin Shekhar Mangar.


Re: Facet Response Structure

2009-09-12 Thread Shalin Shekhar Mangar
On Sat, Sep 12, 2009 at 6:29 PM, smock harish.agar...@gmail.com wrote:


 As to point 1 - this is not a problem with the response structure I've
 outlined.  This is exactly the problem I'm trying to solve.  NULL is not a
 value in the field, it is a placeholder to indicate how many documents the
 field does not exist for.  In my example response structure above,
 'missing'
 is placed outside of the 'facets' list, clearing up the confusion.
 'missing' could indeed be a facet value without any collisions.


You are right, I missed that.


 To point 2 - I understand it would cause compatibility issues, that is why
 I
 was suggesting it be incorporated into the next SOLR release.  I'd also be
 willing to work


I'm not convinced that it is something that needs to be changed. I'm also
not sure about the right way to deprecate a widely used response format. Go
ahead and raise an issue if you want and we can collect thoughts from
others.


 Regarding the stats component, it does not do what you think it does.  It
 reports a count of all values, not distinct values.  The stats component
 also strictly works on numeric fields, which would make it impossible to
 use
 in a lot of cases where the FacetComponent does work.


Yes, my bad. Though it does report the count of missing values.

-- 
Regards,
Shalin Shekhar Mangar.