Re: Retrieving a field from all result docuemnts couple of more queries
Hi, 1)Solr has various type of caches . We can specify how many documents cache can have at a time. e.g. if windowsize=50 50 results will be cached in queryResult Cache. if user makes a new request to server for results after 50 documents a new request will be sent to the server server will retrieve next 50 results in the cache. http://wiki.apache.org/solr/SolrCaching Yes, solr looks into the cache to retrieve the fields to be returned. 2) Yes, we can have different tokenizers or filters for index search. We need not create a different fieldtype. We need to configure the same fieldtype (datatype) for index search analyzers sections differently. e.g. fieldType name=textSpell class=solr.TextField positionIncrementGap=100 stored=false multiValued=true *analyzer type=index* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer * analyzer type=query* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards, Abhay On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore shashik...@gmail.comwrote: Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: How to create a new index file automatically
It can import documents in certain other formats using the http://wiki.apache.org/solr/ExtractingRequestHandler 1) According to my inference.Solr uses Apache Tikka to convert other rich document format files to Text Files, so that the Class ExtractRequestHandler use the output text file to create the Index files. 2. If Point 1 is correct,then I think this could suit my requirements since I need to index rich documents files especially .xls format. But i cant find the class ExtractRequestHandler which has to be configured in SOLRCONFIG.xml file, so that i can import XLS documents through the servlet ttp://localhost:8983/solr/update/extract?= -- View this message in context: http://www.nabble.com/How-to-create-a-new-index-file-automatically-tp25455045p25466714.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr exception with missing required field (meta_guid_s)
On Wed, Sep 16, 2009 at 1:13 AM, kedardes kedar.w...@gmail.com wrote: Hi, I have a data-config file where I map the fields of a very simple table using dynamic field definitions : document name=names entity name=names query=select * from test field column=id name=id_i / field column=name name=name_s / field column=city name=city_s / /entity /document but when I run the dataimport I get this error: WARNING: Error creating document : SolrInputDocumnt[{id_i=id_i(1.0)={2}, name_s=name_s(1.0)={John Smith}, city_s=city_s(1.0)={Newark}}] org.apache.solr.common.SolrException: Document [null] missing required field: meta_guid_s From the schema.xml I see that the meta_guid_s field is defined as a Global unique ID but does this have to be set explicitly or mapped to a particular field? You have created that schema so you are the better person to answer that question. As far as a required field or uniqueKey is concerned, their values have to be set or copied from another field. -- Regards, Shalin Shekhar Mangar.
Re: Questions on copyField
Would appreciate any help on this. Thanks Rahul On Mon, Sep 14, 2009 at 5:12 PM, Rahul R rahul.s...@gmail.com wrote: Hello, I have a few questions regarding the copyField directive in schema.xml 1. Does the destination field store a reference or the actual data ? If I have soemthing like this copyField source=name dest=text/ then will the values in the 'name' field get copied into the 'text' field or will the 'text' field only store a reference to the 'name' field ? To put it more simply, if I later delete the 'name' field from the index will I lose the corresponding data in the 'text' field ? 2. Is there any inbuilt API which I can use to do the copyField action programmatically ? 3. Can I do a copyfield from the schema as well as programmatically for the same destination field Suppose I want the 'text' field to contain values for name, age and location. In my index only 'name' and 'age' are defined as fields. So I can add directives like copyField source=name dest=text/ copyField source=age dest=text/ The location however, I want to add it to the 'text' field programmatically. I don't want to store the location as a separate field in the index. Can I do this ? Thank you. Regards Rahul
Re: Solr results filtered on MoreLikeThis
Hi All, Should I create plugin for this or is there some functionality in solr that can help me. I basically already have part of what I want. The search response gives me a result list with (in my situation) 20 results and the attached morelikethis NamedList. Filtering based on the morelikethis 'duplicates' may result in 12 results, meaning my result list is not complete any more, since I requested 20 results. So now I need to do a new search, which I need to filter yet again. And so on and so forth untill I get a result of 20. This is not a very robust implementation. Can I do something like this on the solr side (via plugin)? For instance filter on the lucene hits based on the morelikethis or something like that. So that I can return exactly 20 results. Also adding the morelikethis to the response. Grouping based on the morelikethis whould even be a nice to have using the collapsing field functionality once it is fully implemented in solr. I hope someone can give me some pointers in the right direction. Kind Regards, Marcel -- View this message in context: http://www.nabble.com/Solr-results-filtered-on-MoreLikeThis-tp25434881p25467907.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Questions on copyField
On Mon, Sep 14, 2009 at 5:12 PM, Rahul R rahul.s...@gmail.com wrote: Hello, I have a few questions regarding the copyField directive in schema.xml 1. Does the destination field store a reference or the actual data ? It makes a copy. Storing or indexing of the field depends on the field configuration. If I have soemthing like this copyField source=name dest=text/ then will the values in the 'name' field get copied into the 'text' field or will the 'text' field only store a reference to the 'name' field ? To put it more simply, if I later delete the 'name' field from the index will I lose the corresponding data in the 'text' field ? The values will get copied. If you delete all values from the 'name' field from the index, the data in text field remain as-is. 2. Is there any inbuilt API which I can use to do the copyField action programmatically ? No. But you can always copy explicitly before sending or you can use a custom UpdateRequestProcessor to copy values from one field to another during indexing. 3. Can I do a copyfield from the schema as well as programmatically for the same destination field Suppose I want the 'text' field to contain values for name, age and location. In my index only 'name' and 'age' are defined as fields. So I can add directives like copyField source=name dest=text/ copyField source=age dest=text/ The location however, I want to add it to the 'text' field programmatically. I don't want to store the location as a separate field in the index. Can I do this ? You can send the location's value directly as the value of the text field. Also note, that you don't really need to index/store the source field. You can make the location field's type as ignored in the schema. -- Regards, Shalin Shekhar Mangar.
Need help to finalize my autocomplete
Hello, I'm using the following code for my autocomplete feature : The field type : fieldType name=autoComplete class=solr.TextField omitNorms=true analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=20 minGramSize=2 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=^(.{20})(.*)? replacement=$1 replace=all / /analyzer /fieldType The field : dynamicField name=*_ac type=autoComplete indexed=true stored=true / The query : ?q=*:*fq=query_ac:harry*wt=jsonrows=15start=0fl=*indent=onfq=model:SearchQuery It gives me a list of results I can parse and use with jQuery autocomplete plugin and all that works very well. Example of results : harry harry potter the last fighting harry harry potter 5 comic relief harry potter What I would like to do now is only to have results starting with the query, so it should be : harry harry potter harry potter 5 Can anybody tell me if it is possible and so how to do it ? Thank you ! Vincent And -- View this message in context: http://www.nabble.com/Need-help-to-finalize-my-autocomplete-tp25468885p25468885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need help to finalize my autocomplete
Instead of tokenizer class=solr.WhitespaceTokenizerFactory / use tokenizer class=solr.KeywordTokenizerFactory/ Cheers Avlesh 2009/9/16 Vincent Pérès vincent.pe...@gmail.com Hello, I'm using the following code for my autocomplete feature : The field type : fieldType name=autoComplete class=solr.TextField omitNorms=true analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=20 minGramSize=2 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=^(.{20})(.*)? replacement=$1 replace=all / /analyzer /fieldType The field : dynamicField name=*_ac type=autoComplete indexed=true stored=true / The query : ?q=*:*fq=query_ac:harry*wt=jsonrows=15start=0fl=*indent=onfq=model:SearchQuery It gives me a list of results I can parse and use with jQuery autocomplete plugin and all that works very well. Example of results : harry harry potter the last fighting harry harry potter 5 comic relief harry potter What I would like to do now is only to have results starting with the query, so it should be : harry harry potter harry potter 5 Can anybody tell me if it is possible and so how to do it ? Thank you ! Vincent And -- View this message in context: http://www.nabble.com/Need-help-to-finalize-my-autocomplete-tp25468885p25468885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need help to finalize my autocomplete
Hello, I tried to replace the class as you suggested, but I still get the same result (and not results where the query start only with the giving query). fieldType name=autoComplete class=solr.TextField omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory maxGramSize=20 minGramSize=2 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=^(.{20})(.*)? replacement=$1 replace=all / /analyzer /fieldType -- View this message in context: http://www.nabble.com/Need-help-to-finalize-my-autocomplete-tp25468885p25469239.html Sent from the Solr - User mailing list archive at Nabble.com.
Mapping SolrDoc to SolrInputDoc
Hi there, currently i'm working on a small app which creates an Embedded Solr Server, reads all documents from one core and puts these docs into another one. The purpose of this app is to apply (small) changes on schema.xml to indexed data (offline) resulting a new index with documents updated to schema.xml changes. What i want to know is if there is an easy way to map SolrDoc to SolrInputDoc. Any help would be much appreciated -- Lici
Re: Need help to finalize my autocomplete
2009/9/16 Vincent Pérès vincent.pe...@gmail.com Hello, I tried to replace the class as you suggested, but I still get the same result (and not results where the query start only with the giving query). Make sure you re-index your documents after change the schema. -- Regards, Shalin Shekhar Mangar.
Re: Need help to finalize my autocomplete
After re-indexing it works very well ! Thanks a lot ! Vincent -- View this message in context: http://www.nabble.com/Need-help-to-finalize-my-autocomplete-tp25468885p25469931.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Mapping SolrDoc to SolrInputDoc
Hi Licinio, You can use ClientUtils.toSolrInputDocument(...), that converts a SolrDocument to a SolrInputDocument. Martijn 2009/9/16 Licinio Fernández Maurelo licinio.fernan...@gmail.com: Hi there, currently i'm working on a small app which creates an Embedded Solr Server, reads all documents from one core and puts these docs into another one. The purpose of this app is to apply (small) changes on schema.xml to indexed data (offline) resulting a new index with documents updated to schema.xml changes. What i want to know is if there is an easy way to map SolrDoc to SolrInputDoc. Any help would be much appreciated -- Lici -- Met vriendelijke groet, Martijn van Groningen
Re: Mapping SolrDoc to SolrInputDoc
I'll try, thanks Martijn 2009/9/16 Martijn v Groningen martijn.is.h...@gmail.com Hi Licinio, You can use ClientUtils.toSolrInputDocument(...), that converts a SolrDocument to a SolrInputDocument. Martijn 2009/9/16 Licinio Fernández Maurelo licinio.fernan...@gmail.com: Hi there, currently i'm working on a small app which creates an Embedded Solr Server, reads all documents from one core and puts these docs into another one. The purpose of this app is to apply (small) changes on schema.xml to indexed data (offline) resulting a new index with documents updated to schema.xml changes. What i want to know is if there is an easy way to map SolrDoc to SolrInputDoc. Any help would be much appreciated -- Lici -- Met vriendelijke groet, Martijn van Groningen -- Lici
Re: Solr results filtered on MoreLikeThis
Have you had a look at the facet query? Not sure but it might just do what you are looking for. http://wiki.apache.org/solr/SolrFacetingOverview http://wiki.apache.org/solr/SimpleFacetParameters Hi All, Should I create plugin for this or is there some functionality in solr that can help me. I basically already have part of what I want. The search response gives me a result list with (in my situation) 20 results and the attached morelikethis NamedList. Filtering based on the morelikethis 'duplicates' may result in 12 results, meaning my result list is not complete any more, since I requested 20 results. So now I need to do a new search, which I need to filter yet again. And so on and so forth untill I get a result of 20. This is not a very robust implementation. Can I do something like this on the solr side (via plugin)? For instance filter on the lucene hits based on the morelikethis or something like that. So that I can return exactly 20 results. Also adding the morelikethis to the response. Grouping based on the morelikethis whould even be a nice to have using the collapsing field functionality once it is fully implemented in solr. I hope someone can give me some pointers in the right direction. Kind Regards, Marcel -- View this message in context: http://www.nabble.com/Solr-results-filtered-on-MoreLikeThis-tp25434881p25467907.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr results filtered on MoreLikeThis
Hi Chantal, Chantal Ackermann wrote: Have you had a look at the facet query? Not sure but it might just do what you are looking for. http://wiki.apache.org/solr/SolrFacetingOverview http://wiki.apache.org/solr/SimpleFacetParameters I still don't really understand facetting? But It might help me using following trick. When I index a document I check for morelikethis. Then each morelikethis and the indexed element itself will get the references to each other via a relatedIds array field. Then (maybe using facetting) I will filter the result based on the id on its own relatedIds. I don't yet know how to do that, but perhaps you understand how this could be done? Example: document1 - id = 1 - relatedIds = [2,3,4,5] - content = 'some cool java job' document2 - id = 2 - relatedIds = [1,3,4,5] - content = 'another cool java job' document3 - id = 3 - relatedIds = [1,2,4,5] - content = 'yet another cool java job' etc... document6 - id = 6 - relatedIds = [] - content = 'this java article is for you'; document7 - id=7 - relatedIds = [8] - content = 'nice java book' document8 - id=8 - relatedIds = [7] - content = 'java book looks nice' Now when I search, I would like to have following results: - document1 (4 related documents) - document6 - document7 (1 related document) Could you give me an example on how I could get that result, maybe using facets? Kind Regards, Marcel -- View this message in context: http://www.nabble.com/Solr-results-filtered-on-MoreLikeThis-tp25434881p25470762.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Retrieving a field from all result docuemnts couple of more queries
Thanks, Abhay. Can someone please throw light on how to disable scoring? --shashi On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar abhay...@gmail.com wrote: Hi, 1)Solr has various type of caches . We can specify how many documents cache can have at a time. e.g. if windowsize=50 50 results will be cached in queryResult Cache. if user makes a new request to server for results after 50 documents a new request will be sent to the server server will retrieve next 50 results in the cache. http://wiki.apache.org/solr/SolrCaching Yes, solr looks into the cache to retrieve the fields to be returned. 2) Yes, we can have different tokenizers or filters for index search. We need not create a different fieldtype. We need to configure the same fieldtype (datatype) for index search analyzers sections differently. e.g. fieldType name=textSpell class=solr.TextField positionIncrementGap=100 stored=false multiValued=true *analyzer type=index* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer * analyzer type=query* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards, Abhay On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore shashik...@gmail.comwrote: Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: Extract info from parent node during data import (redirect:)
Fergus, Implementing wildcard (//tagname) is definitely possible. I would love to see it working. But if you wish to take a dig at it I shall do whatever I can to help. What is the use case that makes flow though so useful? We do not know to which forEach xpath a given field is associated with. Currently you can clean up the fields using a transformer. There is an implicit field '$forEach' which tells you about the xpath tag for each record that is emitted. The recently added comments in XPathRecordReader are a great help and I was planning to add more. Might this be an issue? I would love to have it. Give a patch and I shall commit it. XPathRecordReader is a blackbox and AFAIK I am the only one who knows it. I would love to have more eyes on that. I would like to open a JIRA for improving XPathRecordReader. Please go ahead. You can paste the contents of this mail in the list . There may be others with similar ideas Noble. -Original Message- Noble /document/category/item | /document/category means there are two paths which triggers a new doc (it is possible to have more). Whenever it encounters the closing tag of that xpath , it emits all the fields it collected since the opening of the same tag. after that it clears all the fields it collected since the opening of the tag. If there are fields it collected before opening of the same tag, it retains it Nice and clear, but that is not what I see. With my test case with forEach=/record | /record/mediaBlock I see that for each /record/mediaBlock document indexed it contains all fields from the parent /record document as well. A search over mediaBlock s returns lots of extra fields from the parent which did not have the commonField attribute. I will try and produce a testcase yes it does . . /record/mediaBlock will have all the fields collected from /record as well. *It is by design** Oh! I had always considered it a bug or at least a limitation. After all if we have the commonField attribute why do we need an automatic flow through of all collected fields from parent nodes. This feature is as far as I can see undocumented and at the same time unintuitive. It also, in my case, causes tons more information to be indexed than is needed. I have spent a while thinking through possible use cases. My use case involves having documents we want to search as a whole and behave as normal. At the same time these documents contain inner sections we wish to treat as sub-documents; in my case I a have pictures with associated captions which I wish to search separately. Having indexed the documents with forEach=/record | /record/mediaBlock my picture search works nicely but I have a nasty side effect when performing searches over the rest of the document. Because fields from the parent node are also present in the children, when I search for any text the same document gets returned many times, once due to the text in the parent node and again for each picture placed in the document. I have a work around for this issue but have always considered it a bug. What is the use case that makes flow though so useful? I had just started playing with the code to see how easy this would be to change. The recently added comments in XPathRecordReader are a great help and I was planning to add more. Might this be an issue? I have noted, while lurking on the solr mail lists, that requests for this type of functionality keep coming up; to be able to restrict searches to a sub section of a document. I have really needed this sort of thinks many times with the type of stuff I work with. My other planned activity was to see how easy xpaths such as //tagname would be implement. Since my latest data-config.xml looks like:- field column=para32 name=text xpath=/record/address/para flatten=true / field column=para40 name=text xpath=/record/authoredBy/para flatten=true / field column=para43 name=text xpath=/record/dataGroup/address/para flatten=true / field column=para47 name=text xpath=/record/dataGroup/keyPersonnel/doubleList/first/para flatten=true / field column=para49 name=text xpath=/record/dataGroup/keyPersonnel/doubleList/second/para flatten=true / field column=para50 name=text xpath=/record/dataGroup/keyPersonnel/para flatten=true / field column=para51 name=text xpath=/record/dataGroup/para flatten=true / field column=para57 name=text xpath=/record/doubleList/first/para flatten=true / field column=para59 name=text xpath=/record/doubleList/second/para flatten=true / field column=para63 name=text xpath=/record/keyPersonnel/doubleList/first/para flatten=true / field column=para65 name=text xpath=/record/keyPersonnel/doubleList/second/para flatten=true / field column=para68 name=text xpath=/record/list/listItem/para flatten=true / field column=para75 name=text xpath=/record/mediaBlock/doubleList/first/para flatten=true / field column=para77 name=text xpath=/record/mediaBlock/doubleList/second/para flatten=true
Re: multicore shards and relevancy score
On Tue, Sep 15, 2009 at 8:11 PM, Paul Rosen p...@performantsoftware.comwrote: The second issue was detailed in an email last week shards and facet count. The facet information is lost when doing a search over two shards, so if I use multicore, I can no longer have facets. If both cores' schema is same and a uniqueKey is specified, then you can do a distributed search between two cores. Facets work fine with distributed search. There may be something wrong with your setup. -- Regards, Shalin Shekhar Mangar.
DeltaImport problem
I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DeltaImport problem
I vaguely remember there was an issue with delta-import in 1.3. could you try it out with Solr1.4 On Wed, Sep 16, 2009 at 6:14 PM, KirstyS kirst...@gmail.com wrote: I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: DeltaImport problem
I thought 1.4 was not released yet? Noble Paul നോബിള് नोब्ळ्-2 wrote: I vaguely remember there was an issue with delta-import in 1.3. could you try it out with Solr1.4 On Wed, Sep 16, 2009 at 6:14 PM, KirstyS kirst...@gmail.com wrote: I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471927.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Retrieving a field from all result docuemnts couple of more queries
You might be talking about modifying the similarity object to modify scoring formula in Lucene! $searcher-setSimilarity($similarity); $writer-setSimilarity($similarity); This can very well be done in Solr as SolrIndexWriter inherits from Lucene IndexWriter class. You might want to download the Solr Source code and take a look at the SolrIndexWriter to begin with! It's in the package - org.apache.solr.update Thanks Rajan On Wed, Sep 16, 2009 at 5:42 PM, Shashikant Kore shashik...@gmail.comwrote: Thanks, Abhay. Can someone please throw light on how to disable scoring? --shashi On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar abhay...@gmail.com wrote: Hi, 1)Solr has various type of caches . We can specify how many documents cache can have at a time. e.g. if windowsize=50 50 results will be cached in queryResult Cache. if user makes a new request to server for results after 50 documents a new request will be sent to the server server will retrieve next 50 results in the cache. http://wiki.apache.org/solr/SolrCaching Yes, solr looks into the cache to retrieve the fields to be returned. 2) Yes, we can have different tokenizers or filters for index search. We need not create a different fieldtype. We need to configure the same fieldtype (datatype) for index search analyzers sections differently. e.g. fieldType name=textSpell class=solr.TextField positionIncrementGap=100 stored=false multiValued=true *analyzer type=index* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer * analyzer type=query* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards, Abhay On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore shashik...@gmail.com wrote: Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: DeltaImport problem
yeah, not yet released but going to be released pretty soon On Wed, Sep 16, 2009 at 6:32 PM, KirstyS kirst...@gmail.com wrote: I thought 1.4 was not released yet? Noble Paul നോബിള് नोब्ळ्-2 wrote: I vaguely remember there was an issue with delta-import in 1.3. could you try it out with Solr1.4 On Wed, Sep 16, 2009 at 6:14 PM, KirstyS kirst...@gmail.com wrote: I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471927.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: DeltaImport problem
mmm..can't seem to find the link..could you help? Noble Paul നോബിള് नोब्ळ्-2 wrote: yeah, not yet released but going to be released pretty soon On Wed, Sep 16, 2009 at 6:32 PM, KirstyS kirst...@gmail.com wrote: I thought 1.4 was not released yet? Noble Paul നോബിള് नोब्ळ्-2 wrote: I vaguely remember there was an issue with delta-import in 1.3. could you try it out with Solr1.4 On Wed, Sep 16, 2009 at 6:14 PM, KirstyS kirst...@gmail.com wrote: I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471927.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25472102.html Sent from the Solr - User mailing list archive at Nabble.com.
When to use Solr over Lucene
Hi All, I am aware that Solr internally uses Lucene for search and indexing. But it would be helpful if anybody explains about Solr features that is not provided by Lucene. Thanks, Balaji. -- View this message in context: http://www.nabble.com/When-to-use-Solr-over-Lucene-tp25472354p25472354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Retrieving a field from all result docuemnts couple of more queries
No, I don't wish to put a custom Similarity. Rather, I want an equivalent of HitCollector where I can bypass the scoring altogether. And I prefer to do it by changing the configuration. --shashi On Wed, Sep 16, 2009 at 6:36 PM, rajan chandi chandi.ra...@gmail.com wrote: You might be talking about modifying the similarity object to modify scoring formula in Lucene! $searcher-setSimilarity($similarity); $writer-setSimilarity($similarity); This can very well be done in Solr as SolrIndexWriter inherits from Lucene IndexWriter class. You might want to download the Solr Source code and take a look at the SolrIndexWriter to begin with! It's in the package - org.apache.solr.update Thanks Rajan On Wed, Sep 16, 2009 at 5:42 PM, Shashikant Kore shashik...@gmail.comwrote: Thanks, Abhay. Can someone please throw light on how to disable scoring? --shashi On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar abhay...@gmail.com wrote: Hi, 1)Solr has various type of caches . We can specify how many documents cache can have at a time. e.g. if windowsize=50 50 results will be cached in queryResult Cache. if user makes a new request to server for results after 50 documents a new request will be sent to the server server will retrieve next 50 results in the cache. http://wiki.apache.org/solr/SolrCaching Yes, solr looks into the cache to retrieve the fields to be returned. 2) Yes, we can have different tokenizers or filters for index search. We need not create a different fieldtype. We need to configure the same fieldtype (datatype) for index search analyzers sections differently. e.g. fieldType name=textSpell class=solr.TextField positionIncrementGap=100 stored=false multiValued=true *analyzer type=index* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer * analyzer type=query* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards, Abhay On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore shashik...@gmail.com wrote: Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: DeltaImport problem
http://people.apache.org/builds/lucene/solr/nightly/ On Wed, Sep 16, 2009 at 6:42 PM, KirstyS kirst...@gmail.com wrote: mmm..can't seem to find the link..could you help? Noble Paul നോബിള് नोब्ळ्-2 wrote: yeah, not yet released but going to be released pretty soon On Wed, Sep 16, 2009 at 6:32 PM, KirstyS kirst...@gmail.com wrote: I thought 1.4 was not released yet? Noble Paul നോബിള് नोब्ळ्-2 wrote: I vaguely remember there was an issue with delta-import in 1.3. could you try it out with Solr1.4 On Wed, Sep 16, 2009 at 6:14 PM, KirstyS kirst...@gmail.com wrote: I hope this is the correct place to post this issue and if so, that someone can help. I am using the DIH with Solr 1.3 My data-config.xml file looks like this: dataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver:{taken out for posting} user={taken out for posting} password={taken out for posting} / entity name=article pk=CmsArticleId query=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null deltaQuery=Select a.CmsArticleId, a.CreatorId , LastUpdatedBy, LastUpdateDate, Title, Synopsis, Author, Source, IsPublished, ArticleTypeId, a.StrapHead, ShortHeading, HomePageBlurb, ByLine, ArticleStatusId, LiveEdit, a.OriginalCategoryId, aac.Rank, c.CategoryId, AncestralName, CategoryName, CategoryDisplayName, ParentCategoryId, c.SiteId from Category c (nolock) inner join CmsArticleCollection ac (nolock) on c.DefaultCmsArticleCollectionId = ac.CmsArticleCollectionId inner join CmsArticleArticleCollection aac (nolock) on ac.CmsArticleCollectionId = aac.CmsArticleCollectionId inner join CmsArticle a (nolock) on aac.CmsArticleId = a.CmsArticleId where (a.LiveEdit is null or a.LiveEdit = 0) and aac.SourceCmsArticleArticleCollectionId is null and (LastUpdateDate '${dataimporter.last_index_time}' OR a.CreationDate '${dataimporter.last_index_time}') Have tried casting the dataimporter.last_index_time and the other date fields. To no avail. My Full Import works perfectly but I cannot get the command=delta-import to pick up the updated records. The LastUpdateDate is being updated. When I run this in the debug interface with delta-import it just never calls the delta import. Please, if anyone knows what I am doing wrong??? Many thanks Kirsty -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471596.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25471927.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- View this message in context: http://www.nabble.com/DeltaImport-problem-tp25471596p25472102.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: When to use Solr over Lucene
On Sep 16, 2009, at 9:26 AM, balaji.a wrote: Hi All, I am aware that Solr internally uses Lucene for search and indexing. But it would be helpful if anybody explains about Solr features that is not provided by Lucene. Solr is a server, Lucene is an API Faceting Distributed search Replication Easy configuration You don't want to program much (or do Java) Index warming http://lucene.apache.org/solr/features.html Generally speaking, Solr is what you end up building when you build a Lucene search application, give or take a few features here and there. I've seen a lot of Lucene apps and I'm always amazed how many look pretty much like Solr in terms of infrastructure. I'd use Lucene when you want to have control over every last bit of how things work or you need something that isn't in Solr (like Span Queries, but even that is doable in Solr w/ a little work) Thanks, Balaji. -- View this message in context: http://www.nabble.com/When-to-use-Solr-over-Lucene-tp25472354p25472354.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: When to use Solr over Lucene
Comparing Solr to Lucene is not exactly an apples-to-apples comparison. Solr is a superset of Lucene. It uses the Lucene engine to index and process requests for data retrieval. Start here first : * http://lucene.apache.org/solr/features.html#Solr+Uses+the+Lucene+Search+Library+and+Extends+it !* It would be unfair to compare to the Apache webserver to a cgi scripting interface. The apache webserver is just the container through with the webrowser interacts with the CGI scripts. This is very similar to how Solr is related to Lucene. On Wed, Sep 16, 2009 at 9:26 AM, balaji.a reachbalaj...@gmail.com wrote: Hi All, I am aware that Solr internally uses Lucene for search and indexing. But it would be helpful if anybody explains about Solr features that is not provided by Lucene. Thanks, Balaji. -- View this message in context: http://www.nabble.com/When-to-use-Solr-over-Lucene-tp25472354p25472354.html Sent from the Solr - User mailing list archive at Nabble.com. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: When to use Solr over Lucene
balaji.a wrote: Hi All, I am aware that Solr internally uses Lucene for search and indexing. But it would be helpful if anybody explains about Solr features that is not provided by Lucene. Thanks, Balaji. Any advanced Lucene application generally goes down the same path: Build a system to manage IndexReaders and IndexWriters and concurrency and index view refreshing. This is very hard for beginners to get right - though many have tried. Figure out how you want to manage (or not) some kind of schema. Write something that *basically* does the job. Write something so that none java programmers can setup the schema so you don't have to. Add niceties on top, like support for efficient autocomplete and spellchecking and faceting and plugins. Figure out a scheme to replicate and distribute indexes so that you can scale. Add support for other APIs. REST, perl, whatever else your crazy superiors are pulling from your crazy coworkers. Add support for parsing rich documents, like pdfs, ms word, and dozens of other formats. Do it a short time with a small team. Spend a lot of time fixing bugs and whacking at performance issues. Get most of it wrong because you will the first time you do this. If your lucky: get a lot of it right too and feel great about your large complicated system as you hurry to fix all of its many imperfections - and then spend lots of time keeping up with the latest changes, features, and improvements added to Lucene. Or sit on the old features frozen in time. You won't have done it all either - its too much work to do it all well in a reasonable amount of time for a dev team that is not actually supposed to be building a search server. You will cut stuff, you will skimp on stuff, and you will make tradeoffs left and right. I've gone down that path - I started *just* before Solr got rolling in 06. Lots of people have gone down that path or are on that path. Solr does all of that for you, and it does it well. Many of those that work on Lucene work on Solr. New Lucene features automatically go into Solr. Solr will be maintained and developed by a team of people that are not you, while your homegrown system (which does only 60% of what Solr does and does it worse) will likely cobweb over 95% of the code. I love developing with Lucene, and I bet you will too - but most people should be using Solr. Certain, target applications can still benefit using Lucene. Some Lucene features don't move to Solr for a while. If you want near real-time, thats only in Lucene right now. If you want everything done per segment, that just Lucene right now (Solr still does some thing not per segment). There are other little pros as well. Its a tradeoff, that for the general guy looking for search, heavily favors using Solr. -- - Mark http://www.lucidimagination.com
Re: Retrieving a field from all result docuemnts couple of more queries
You will need to get SolrIndexSearcher.java and modify following:- public static final int GET_SCORES = 0x01; --Rajan On Wed, Sep 16, 2009 at 6:58 PM, Shashikant Kore shashik...@gmail.comwrote: No, I don't wish to put a custom Similarity. Rather, I want an equivalent of HitCollector where I can bypass the scoring altogether. And I prefer to do it by changing the configuration. --shashi On Wed, Sep 16, 2009 at 6:36 PM, rajan chandi chandi.ra...@gmail.com wrote: You might be talking about modifying the similarity object to modify scoring formula in Lucene! $searcher-setSimilarity($similarity); $writer-setSimilarity($similarity); This can very well be done in Solr as SolrIndexWriter inherits from Lucene IndexWriter class. You might want to download the Solr Source code and take a look at the SolrIndexWriter to begin with! It's in the package - org.apache.solr.update Thanks Rajan On Wed, Sep 16, 2009 at 5:42 PM, Shashikant Kore shashik...@gmail.com wrote: Thanks, Abhay. Can someone please throw light on how to disable scoring? --shashi On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar abhay...@gmail.com wrote: Hi, 1)Solr has various type of caches . We can specify how many documents cache can have at a time. e.g. if windowsize=50 50 results will be cached in queryResult Cache. if user makes a new request to server for results after 50 documents a new request will be sent to the server server will retrieve next 50 results in the cache. http://wiki.apache.org/solr/SolrCaching Yes, solr looks into the cache to retrieve the fields to be returned. 2) Yes, we can have different tokenizers or filters for index search. We need not create a different fieldtype. We need to configure the same fieldtype (datatype) for index search analyzers sections differently. e.g. fieldType name=textSpell class=solr.TextField positionIncrementGap=100 stored=false multiValued=true *analyzer type=index* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ !--filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false/-- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer * analyzer type=query* tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards, Abhay On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore shashik...@gmail.com wrote: Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: When to use Solr over Lucene
Also Solr simplifies the process of implementing the client side interface. You can use the same indices with clients written in any programming language. The client side could be in virtually any programming language of your choosing. If you were to work directly with Lucene, that would not be the case. On Wed, Sep 16, 2009 at 9:49 AM, Israel Ekpo israele...@gmail.com wrote: Comparing Solr to Lucene is not exactly an apples-to-apples comparison. Solr is a superset of Lucene. It uses the Lucene engine to index and process requests for data retrieval. Start here first : * http://lucene.apache.org/solr/features.html#Solr+Uses+the+Lucene+Search+Library+and+Extends+it !* It would be unfair to compare to the Apache webserver to a cgi scripting interface. The apache webserver is just the container through with the webrowser interacts with the CGI scripts. This is very similar to how Solr is related to Lucene. On Wed, Sep 16, 2009 at 9:26 AM, balaji.a reachbalaj...@gmail.com wrote: Hi All, I am aware that Solr internally uses Lucene for search and indexing. But it would be helpful if anybody explains about Solr features that is not provided by Lucene. Thanks, Balaji. -- View this message in context: http://www.nabble.com/When-to-use-Solr-over-Lucene-tp25472354p25472354.html Sent from the Solr - User mailing list archive at Nabble.com. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Disabling tf (term frequency) during indexing and/or scoring
Hi Aaron, You can overwrite default Lucene Similarity and disable tf and lengthNorm factors in scoring formula ( see http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html and http://lucene.apache.org/java/2_4_1/api/index.html ) You need to 1) compile the following class and put it into Solr WEB-INF/classes --- package my.package; import org.apache.lucene.search.DefaultSimilarity; public class NoLengthNormAndTfSimilarity extends DefaultSimilarity { public float lengthNorm(String fieldName, int numTerms) { return numTerms 0 ? 1.0f : 0.0f; } public float tf(float freq) { return freq 0 ? 1.0f : 0.0f; } } --- 2. Add similarity class=my.package.NoLengthNormAndTfSimilarity/ into your schema.xml http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca HIH, Alex On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee ucbmc...@gmail.com wrote: Hello, Let me preface this by admitting that I'm still fairly new to Lucene and Solr, so I apologize if any of this sounds naive and I'm open to thinking about my problem differently. I'm currently responsible for a rather large dataset of business records that I'm trying to build a Lucene/Solr infrastructure around, to replace an in-house solution that we've been using for a few years. These records are sourced from multiple providers and there's often a fair bit of overlap in the business coverage. I have a set of fuzzy correlation libraries that I use to identify these documents and I ultimately create a super-record that includes metadata from each of the providers. Given the nature of things, these providers often have slight variations in wording or spelling in the overlapping fields (it's amazing how many ways people find to refer to the same business or address). I'd like to capture these variations, as they facilitate searching, but TF considerations are currently borking field scoring here. For example, taking business names into consideration, I have a Solr schema similar to: field name=name_provider1 type=string indexed=false stored=false multiValued=true ... field name=name_providerN type=string indexed=false stored=false multiValued=true field name=nameNorm type=text indexed=true stored=false multiValued=true omitNorms=true copyField source=name_provider1 dest=nameNorm ... copyField source=name_providerN dest=nameNorm For any given business record, there may be 1..N business names present in the nameNorm field (some with naming variations, some identical). With TF enabled, however, I'm getting different match scores on this field simply based on how many providers contributed to the record, which is not meaningful to me. For example, a record containing nameNormfoo barpositionIncrementGapfoo bar/nameNorm is necessarily scoring higher than a record just containing nameNormfoo bar/nameNorm. Although I wouldn't mind TF data being considered within each discrete field value, I need to find a way to prevent score inflation based simply on the number of contributing providers. Looking at the mailing list archive and searching around, it sounds like the omitTf boolean in Lucene used to function somewhat in this manner, but has since taken on a broader interpretation (and name) that now also disables positional and payload data. Unfortunately, phrase support for fields like this is absolutely essential. So what's the best way to address a need like this? I guess I don't mind whether this is handled at index time or search time, but I'm not sure what I may need to override or if there's some existing provision I should take advantage of. Thank you for any help you may have. Best regards, Aaron
Re: Disabling tf (term frequency) during indexing and/or scoring
Just FYI - you can put Solr plugins in solr-home/lib as JAR files rather than messing with solr.war Erik On Sep 16, 2009, at 10:15 AM, Alexey Serba wrote: Hi Aaron, You can overwrite default Lucene Similarity and disable tf and lengthNorm factors in scoring formula ( see http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html and http://lucene.apache.org/java/2_4_1/api/index.html ) You need to 1) compile the following class and put it into Solr WEB-INF/classes --- package my.package; import org.apache.lucene.search.DefaultSimilarity; public class NoLengthNormAndTfSimilarity extends DefaultSimilarity { public float lengthNorm(String fieldName, int numTerms) { return numTerms 0 ? 1.0f : 0.0f; } public float tf(float freq) { return freq 0 ? 1.0f : 0.0f; } } --- 2. Add similarity class=my.package.NoLengthNormAndTfSimilarity/ into your schema.xml http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca HIH, Alex On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee ucbmc...@gmail.com wrote: Hello, Let me preface this by admitting that I'm still fairly new to Lucene and Solr, so I apologize if any of this sounds naive and I'm open to thinking about my problem differently. I'm currently responsible for a rather large dataset of business records that I'm trying to build a Lucene/Solr infrastructure around, to replace an in-house solution that we've been using for a few years. These records are sourced from multiple providers and there's often a fair bit of overlap in the business coverage. I have a set of fuzzy correlation libraries that I use to identify these documents and I ultimately create a super- record that includes metadata from each of the providers. Given the nature of things, these providers often have slight variations in wording or spelling in the overlapping fields (it's amazing how many ways people find to refer to the same business or address). I'd like to capture these variations, as they facilitate searching, but TF considerations are currently borking field scoring here. For example, taking business names into consideration, I have a Solr schema similar to: field name=name_provider1 type=string indexed=false stored=false multiValued=true ... field name=name_providerN type=string indexed=false stored=false multiValued=true field name=nameNorm type=text indexed=true stored=false multiValued=true omitNorms=true copyField source=name_provider1 dest=nameNorm ... copyField source=name_providerN dest=nameNorm For any given business record, there may be 1..N business names present in the nameNorm field (some with naming variations, some identical). With TF enabled, however, I'm getting different match scores on this field simply based on how many providers contributed to the record, which is not meaningful to me. For example, a record containing nameNormfoo barpositionIncrementGapfoo bar/nameNorm is necessarily scoring higher than a record just containing nameNormfoo bar/nameNorm. Although I wouldn't mind TF data being considered within each discrete field value, I need to find a way to prevent score inflation based simply on the number of contributing providers. Looking at the mailing list archive and searching around, it sounds like the omitTf boolean in Lucene used to function somewhat in this manner, but has since taken on a broader interpretation (and name) that now also disables positional and payload data. Unfortunately, phrase support for fields like this is absolutely essential. So what's the best way to address a need like this? I guess I don't mind whether this is handled at index time or search time, but I'm not sure what I may need to override or if there's some existing provision I should take advantage of. Thank you for any help you may have. Best regards, Aaron
Any way to encrypt/decrypt stored fields?
For security reasons (say I'm indexing very sensitive data, medical records for example) is there a way to encrypt data that is stored in Solr? Some businesses I've encountered have such needs and this is a barrier to them adopting Solr to replace other legacy systems. Would it require a custom-written filter to encrypt during indexing and decrypt at query time, or is there something I'm unaware of already available to do this? -Jay
Re: CSV Update - Need help mapping csv field to schema's ID
Thanks guys... Yonik and Grant commented on this thread in the dev group. Dan Chris Hostetter wrote: : I would like to add an additional name:value pair for every line, mapping the : sku field to my schema's id field: : : .map={sku.field}:{id} the map param is for replacing a *value* with a different' value ... it's useful for things like numeric codes in CSV files that you want to replace with strings in your index. : I would prefer NOT to change the schema by adding a copyField source=sku : dest=id/. that's the only solution i can think of unless you want to write an UpdateProcessor. -Hoss
Re: Any way to encrypt/decrypt stored fields?
That's certainly something that is doable with a filter. I am not aware of any available. Bill On Wed, Sep 16, 2009 at 10:39 AM, Jay Hill jayallenh...@gmail.com wrote: For security reasons (say I'm indexing very sensitive data, medical records for example) is there a way to encrypt data that is stored in Solr? Some businesses I've encountered have such needs and this is a barrier to them adopting Solr to replace other legacy systems. Would it require a custom-written filter to encrypt during indexing and decrypt at query time, or is there something I'm unaware of already available to do this? -Jay
Re: Any way to encrypt/decrypt stored fields?
This could be achieved purely client-side if all you're talking about is a stored field (not indexed/searchable). The client-side could encrypt and encode the encrypted bits as text that Solr/Lucene can store. Then decrypt client-side. Erik On Sep 16, 2009, at 10:39 AM, Jay Hill wrote: For security reasons (say I'm indexing very sensitive data, medical records for example) is there a way to encrypt data that is stored in Solr? Some businesses I've encountered have such needs and this is a barrier to them adopting Solr to replace other legacy systems. Would it require a custom-written filter to encrypt during indexing and decrypt at query time, or is there something I'm unaware of already available to do this? -Jay
Re: do NOT want to stem plurals for a particular field, or words
You can enable/disable stemming per field type in the schema.xml, by removing the stemming filters from the type definition. Basically, copy your prefered type, rename it to something like 'text_nostem', remove the stemming filter from the type and use your 'text_nostem' type for your field 'type' . + you can search in both fields text_stemmed and text_exact using DisMax handler and boost text_exact matching. Thus if you search for 'articles' you'll get all results with 'articles' and 'article', but exact match will be on top.
Re: faceted query not working as i expected
Thank you Ahmet. I forgot to encapuslate the searched string in quotations. On Sep 15, 2009, at 5:19 PM, AHMET ARSLAN wrote: --- On Tue, 9/15/09, Jonathan Vanasco jvana...@2xlp.com wrote: From: Jonathan Vanasco jvana...@2xlp.com Subject: faceted query not working as i expected To: solr-user@lucene.apache.org Date: Tuesday, September 15, 2009, 10:54 PM I'm trying to request documents that have facet.venue_type as Private Collection Instead I'm also getting items where another field is marked Permanent Collection My schema has: fields field name=venue_type type=text indexed=true stored=true required=false / field name=facet.venue_type type=string indexed=true stored=true required=false / /fields copyField source=venue_type dest=facet.venue_type / My query is q=*:* qt=standard facet=true facet.missing=true facet.field=facet.venue_type fq=venue_type:Private+Collection Can anyone offer a suggestion as to what I'm doing wrong ? The filter query fq=venue_type:Private+Collection has a part that runs on default field. It is parsed to venue_type:Private defaultField:Collection You can use fq=venue_type:Private+Collection or fq=venue_type:(Private AND Collection) instead. These will/may bring documents having something Private Collection in venue_type field since it is a tokenized field. If you want to retrieve documents that have facet.venue_type as Private Collection you can use fq:facet.venue_type:Private Collection that operates on a string (non-tokenized) field. Hope this helps.
Highlighting in stemmed or n-grammed fields possible?
Hi, Anybody knows how to get the highlighted field, when q term matches in a stemmed or n-grammed filtered field? Matching in a normal field (not stemmed or n-grammed) highlighting works perfectly as expected. But in stemmed matching cases, no highlighting fields are recovered, and in n-gramming matching highlighting field is recovered but in a bad order (example: if q=”solr” matches with “here is solr” results to “emhere/em is solr”). All fields are stored (and indexed as well….). Thanks in advance.
Re: FileListEntityProcessor and LineEntityProcessor
Hi, I'm trying to import data from a list of files using the FileListEntityProcessor. Here is my import configuration: dataSource type=FileDataSource name=fileDataSource/ document name=dict-entries entity name=f processor=FileListEntityProcessor baseDir=d:\my\directory\ fileName=.*WRK recursive=false rootEntity=false entity name=jc processor=LineEntityProcessor url=${f.fileAbsolutePath} dataSource=fileDataSource transformer=myTransformer /entity /entity /document If I have only one file in d:\my\directory\ then everything works correctly. If I have multiple files then I get the following exception: Sorry but I dont quite follow this. FileListEntityProcessor and LineEntityProcessor are somewhat similar in that they provide a list of filenames which the likes of XPathEntityProcessor then open and parse. Is the above your complete data-config.xml? Can you provide more detail on what you are trying to do? ... You seem to listing all files d:\my\directory\.*WRK. Do these WRK files contain lists of files to be indexed? Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DocBuilder buildDocum ent SEVERE: Exception while processing: f document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Problem reading f rom input Processing Document # 53812 at org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn tityProcessor.java:112) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:348) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:376) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:224) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :167) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:316) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:376) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:355) Caused by: java.io.IOException: Stream closed at java.io.BufferedReader.ensureOpen(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source) at org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn tityProcessor.java:109) ... 8 more Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DataImporter doFullIm port SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Problem reading f rom input Processing Document # 53812 at org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn tityProcessor.java:112) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:348) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:376) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:224) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :167) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:316) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:376) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:355) Caused by: java.io.IOException: Stream closed at java.io.BufferedReader.ensureOpen(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source) at org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn tityProcessor.java:109) ... 8 more Note that my input files have 53812 lines, which is the same as the document number that I'm choking on. Does anyone know what I'm doing wrong? Thanks, Wojtek -- View this message in context: http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25476443.html Sent from the Solr - User mailing list archive at Nabble.com. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: FileListEntityProcessor and LineEntityProcessor
Fergus McMenemie-2 wrote: Can you provide more detail on what you are trying to do? ... You seem to listing all files d:\my\directory\.*WRK. Do these WRK files contain lists of files to be indexed? That is my complete data config file. I have a directory containing a bunch of files that have one entity per line. Each line contains blocks of data. I parse out each block and process it appropriately using myTransformer. Is this use of FileListEntityProcessor with LineEntityProcessor not supported? -- View this message in context: http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25477613.html Sent from the Solr - User mailing list archive at Nabble.com.
Effect of SynonymFilter on Solr document fields
Hi, I am a newbie to Solr and request you all to kindly excuse any rookie mistakes. I have the following questions: We use the Synonym Filter on one of the indexed fields. It is specified in the schema.xml as one of the filters (for the analyzer type index) for that field. I believe that this means any tokens which match an entry in the provided synonym file will have all the forms indexed provided expanded=true. I am able to verify that by using the Solr admin analysis tool. However when I use Luke to examine a document in the index which would have synonyms for that particular field, I see only the original value and do not see the additional forms that should be added due to the synonym match for the field in question. I am not sure if I am missing something here. How do I verify the same? Another related question The field in question here is not specified as multivalued. However, as I understand it a synonym match will mean multiple values for that field. I was not able to find any documentation that explains this in detail and would like to know how this particular case impacts the indexing of that field, scoring, etc. How does the behavior of a field having multiple values due to SynonymFilter compare and contrast with the multivalued=true|false flag. What would a synonym match expansion for a field with multivalued=false mean? Prasanna.
Re: FileListEntityProcessor and LineEntityProcessor
Note that if I change my import file to explicitly list all my files (instead of using the FileListEntityProcessor) as below then everything works as I expect. dataSource type=FileDataSource name=fileDataSource basePath=d:\my\directory\/ document name=dict-entries entity name=jc processor=LineEntityProcessor url=file1.WRK dataSource=fileDataSource transformer=myTransformer/entity entity name=jc processor=LineEntityProcessor url=file2.WRK dataSource=fileDataSource transformer=myTransformer/entity entity name=jc processor=LineEntityProcessor url=file3.WRK dataSource=fileDataSource transformer=myTransformer/entity ... /document -- View this message in context: http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25480830.html Sent from the Solr - User mailing list archive at Nabble.com.
Latest trunk locks execution thread in SolrCore.getSearcher()
Hi, I am testing EmbeddedSolrServer vs StreamingUpdateSolrServer for my crawlers using more or less recent Solr code and everything was fine till today when I took the latest trunk code. When I start my crawler I see a number of INFO outputs 2009-09-16 21:08:29,399 INFO Adding component:org.apache.solr.handler.component.highlightcompon...@36ae83 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,400 INFO Adding component:org.apache.solr.handler.component.statscompon...@1fb24d3 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,401 INFO Adding component:org.apache.solr.handler.component.termvectorcompon...@14ba9a2 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,402 INFO Adding debug component:org.apache.solr.handler.component.debugcompon...@12ea1dd (SearchHandler.java:137) - [main] and then the log/program stops. The thread dump reveals the following: main prio=3 tid=0x0003 nid=0x2 in Object.wait() [0xfe67c000..0xfe67fd80] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xeaaf6b10 (a java.lang.Object) at java.lang.Object.wait(Object.java:485) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:991) - locked 0xeaaf6b10 (a java.lang.Object) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:904) at org.apache.solr.handler.ReplicationHandler.getIndexVersion(ReplicationHa ndler.java:472) at org.apache.solr.handler.ReplicationHandler.getStatistics(ReplicationHand ler.java:490) at org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo(JmxMo nitoredMap.java:224) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getNewMBeanClassNa me(DefaultMBeanServerInterceptor.java:321) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Defa ultMBeanServerInterceptor.java:307) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java :482) at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:137) at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:47) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:4 46) at org.apache.solr.core.SolrCore.init(SolrCore.java:578) at harvard.solr.search.service.EmbeddedSearchService.setSolrHome(EmbeddedSe archService.java:47) The same is happening for the StreamingUpdateSolrServer. Do you think it's a bug? Thank you for looking into it, -Olga
Re: Latest trunk locks execution thread in SolrCore.getSearcher()
On a quick look, it looks like this was caused (or at least triggered by) https://issues.apache.org/jira/browse/SOLR-1427 Registering the bean in the SolrCore constructor causes it to immediately turn around and ask for the stats which asks for a searcher, which blocks. -Yonik http://www.lucidimagination.com On Wed, Sep 16, 2009 at 9:34 PM, Dadasheva, Olga olga_dadash...@harvard.edu wrote: Hi, I am testing EmbeddedSolrServer vs StreamingUpdateSolrServer for my crawlers using more or less recent Solr code and everything was fine till today when I took the latest trunk code. When I start my crawler I see a number of INFO outputs 2009-09-16 21:08:29,399 INFO Adding component:org.apache.solr.handler.component.highlightcompon...@36ae83 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,400 INFO Adding component:org.apache.solr.handler.component.statscompon...@1fb24d3 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,401 INFO Adding component:org.apache.solr.handler.component.termvectorcompon...@14ba9a2 (SearchHandler.java:132) - [main] 2009-09-16 21:08:29,402 INFO Adding debug component:org.apache.solr.handler.component.debugcompon...@12ea1dd (SearchHandler.java:137) - [main] and then the log/program stops. The thread dump reveals the following: main prio=3 tid=0x0003 nid=0x2 in Object.wait() [0xfe67c000..0xfe67fd80] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xeaaf6b10 (a java.lang.Object) at java.lang.Object.wait(Object.java:485) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:991) - locked 0xeaaf6b10 (a java.lang.Object) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:904) at org.apache.solr.handler.ReplicationHandler.getIndexVersion(ReplicationHa ndler.java:472) at org.apache.solr.handler.ReplicationHandler.getStatistics(ReplicationHand ler.java:490) at org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo(JmxMo nitoredMap.java:224) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getNewMBeanClassNa me(DefaultMBeanServerInterceptor.java:321) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Defa ultMBeanServerInterceptor.java:307) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java :482) at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:137) at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:47) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:4 46) at org.apache.solr.core.SolrCore.init(SolrCore.java:578) at harvard.solr.search.service.EmbeddedSearchService.setSolrHome(EmbeddedSe archService.java:47) The same is happening for the StreamingUpdateSolrServer. Do you think it's a bug? Thank you for looking into it, -Olga
Re: FileListEntityProcessor and LineEntityProcessor
I have opened an issue SOLR-1440 On Thu, Sep 17, 2009 at 2:46 AM, wojtekpia wojte...@hotmail.com wrote: Note that if I change my import file to explicitly list all my files (instead of using the FileListEntityProcessor) as below then everything works as I expect. dataSource type=FileDataSource name=fileDataSource basePath=d:\my\directory\/ document name=dict-entries entity name=jc processor=LineEntityProcessor url=file1.WRK dataSource=fileDataSource transformer=myTransformer/entity entity name=jc processor=LineEntityProcessor url=file2.WRK dataSource=fileDataSource transformer=myTransformer/entity entity name=jc processor=LineEntityProcessor url=file3.WRK dataSource=fileDataSource transformer=myTransformer/entity ... /document -- View this message in context: http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25480830.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] URLDataSource and fetching a link
2009/9/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: it is possible to have a sub entity which has XPathEntityProcessor which can use the link ar the url This may not be a good solution. But you can use the $hasMore and $nextUrl options of XPathEntityProcessor to recursively loop if there are more links On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll gsing...@apache.org wrote: Many RSS feeds contain a link to some full article. How can I have the DIH get the RSS feed and then have it go and fetch the content at the link? Thanks, Grant -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Questions on copyField
Shalin, Can you please elaborate a little more on the third response *You can send the location's value directly as the value of the text field.* I dont follow. I am adding 'name' and 'age' to the 'text' field through the schema. If I add the 'location' from the program, will either one copy (schema or program) not over-write the other ? *Also note, that you don't really need to index/store the source field. You can make the location field's type as ignored in the schema.* Understood Thank you for your response. Regards Rahul On Wed, Sep 16, 2009 at 1:56 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Sep 14, 2009 at 5:12 PM, Rahul R rahul.s...@gmail.com wrote: Hello, I have a few questions regarding the copyField directive in schema.xml 1. Does the destination field store a reference or the actual data ? It makes a copy. Storing or indexing of the field depends on the field configuration. If I have soemthing like this copyField source=name dest=text/ then will the values in the 'name' field get copied into the 'text' field or will the 'text' field only store a reference to the 'name' field ? To put it more simply, if I later delete the 'name' field from the index will I lose the corresponding data in the 'text' field ? The values will get copied. If you delete all values from the 'name' field from the index, the data in text field remain as-is. 2. Is there any inbuilt API which I can use to do the copyField action programmatically ? No. But you can always copy explicitly before sending or you can use a custom UpdateRequestProcessor to copy values from one field to another during indexing. 3. Can I do a copyfield from the schema as well as programmatically for the same destination field Suppose I want the 'text' field to contain values for name, age and location. In my index only 'name' and 'age' are defined as fields. So I can add directives like copyField source=name dest=text/ copyField source=age dest=text/ The location however, I want to add it to the 'text' field programmatically. I don't want to store the location as a separate field in the index. Can I do this ? You can send the location's value directly as the value of the text field. Also note, that you don't really need to index/store the source field. You can make the location field's type as ignored in the schema. -- Regards, Shalin Shekhar Mangar.