Re: Problems to clustering on tomcat
Claudio, It sounds like the word Cluster there is adding confusion. ClusteringComponent has to do with search results clustering. What you seem to be after is creation of a Solr cluster. You'll find good pointers here: http://search-lucene.com/?q=master+slavefc_project=Solrfc_type=wiki Perhaps this is the best place to start: http://wiki.apache.org/solr/SolrReplication Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Claudio Devecchi cdevec...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, August 9, 2010 7:07:54 PM Subject: Problems to clustering on tomcat Hi everybody, I need to do some tests in my solr instalation, previously I configured my application on a single node, and now I need to make some tests on a cluster configuration. I followed the steps on http://wiki.apache.org/solr/ClusteringComponent; and when a startup the example system everything is ok, but when I try to run it on tomcat I receive the error bellow, somebody have an idea? SEVERE: Could not start SOLR. Check solr/home property org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.clustering.ClusteringComponent' -- Claudio Devecchi flickr.com/cdevecchi
Re: solr query result not read the latest xml file
hi everyone, I do these steps every time the new xml file created (for example cat_978.xml has just been created): 1. delete the index (deletequeryAUC_CAT:978/query/delete) 2. commit the new cat_978.xml (java -jar post.jar cat_978.xml) 3. restart the java (stop and java -jar start.jar) if I'm not done those steps then the query result showed in the browser still using the old value (cat_978.xml - no changes at all) instead of reading the new cat_978.xml what I want to ask, is there a way so I don't need to restart the java since it consume too much resources and time? You dont need to delete old document. Solr replaces it automaticaly. Assuming they have same uniqueKey. Probably HTTP caching causing you problems when testing with browser. You can disable it in solrconfig.xml file httpCaching never304=true
Re: how to support implicit trailing wildcards
you could satisfy this by making 2 fields: 1. exactmatch 2. wildcardmatch use copyfield in your schema to copy 1 -- 2 . q=exactmatch:mount+wildcardmatch:mount*q.op=OR this would score exact matches above (solely) wildcard matches Geert-Jan 2010/8/10 yandong yao yydz...@gmail.com Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F , Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer bspit...@magix.net Wildcard-Search is already built in, just use: ?q=umoun* ?q=mounta* -Ursprüngliche Nachricht- Von: yandong yao [mailto:yydz...@gmail.com] Gesendet: Montag, 9. August 2010 15:57 An: solr-user@lucene.apache.org Betreff: how to support implicit trailing wildcards Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
Re: solr query result not read the latest xml file
I already set in my solrconfig.xml as you told me: httpCaching never304=false/httpCaching and then I commit the xml and it's still not working the query result still show the old data :( do you have any suggestion? Eben -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068647.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
I already set in my solrconfig.xml as you told me: httpCaching never304=false/httpCaching and then I commit the xml and it's still not working the query result still show the old data :( do you have any suggestion? Shouldn't it be never304=true? You wrote never304=false Additionally cant you try with something else than browser, curl, wget etc.
AW: solr query result not read the latest xml file
make sure you send a commit/ after add/delete to make the changes visible. -Ursprüngliche Nachricht- Von: e8en [mailto:e...@tokobagus.com] Gesendet: Dienstag, 10. August 2010 10:04 An: solr-user@lucene.apache.org Betreff: Re: solr query result not read the latest xml file I already set in my solrconfig.xml as you told me: httpCaching never304=false/httpCaching and then I commit the xml and it's still not working the query result still show the old data :( do you have any suggestion? Eben -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068647.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Delta import where last_modified
Hi all. I have set my data-config with mysql database. The problem i am having is mysql doesn't execute deltaquery. The where last_modified is not executed and throws an error of unknown column last_modified in where clause. Shouldn't this be treated as a deltaquery instead of a column in table. Am i missing any configurations. Highly appreciate for any feedback about this. Hando -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Delta-import-where-last-modified-tp1068743p1068743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
yes I try with both value, never304=true and never304=false and none of them make it works what is curl and wget? I use mozilla firefox browser I'm really newbie in programming world especially solr -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: AW: solr query result not read the latest xml file
hi Bastian, how to send a commit/? is it by typing : java -jar post.jar cat_978.xml? if yes then I've already done that any solution please? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
yes I try with both value, never304=true and never304=false and none of them make it works It must be httpCaching never304=true /httpCaching, so lets forget about never304=false. But when you change something in solrconfig.xml you need to restart jetty/tomcat. java -jar post.jar *.xml does /commit by default at the end. what is curl and wget? They are command line tools. I use mozilla firefox browser I'm really newbie in programming world especially solr May be you can configure firefox to disable caches.
AW: AW: solr query result not read the latest xml file
you can check the admin panel to see if there are pending deletes/commits in the statistics section. older versions of post.jar dont auto-commit the changes, so if your xml doesnt contain a commit/ you could just create a commit.xml containing only the following: commit/ and send it via post.jar. you can also curl it or whatever u like: curl http://hostname:port/solr/update -H Content-Type: text/xml --data-binary 'commit/' -Ursprüngliche Nachricht- Von: e8en [mailto:e...@tokobagus.com] Gesendet: Dienstag, 10. August 2010 10:22 An: solr-user@lucene.apache.org Betreff: Re: AW: solr query result not read the latest xml file hi Bastian, how to send a commit/? is it by typing : java -jar post.jar cat_978.xml? if yes then I've already done that any solution please? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
finally I found out the cause of my problem yes you don't need to delete the index and restart the tomcat just to get the data query result updated, you just need to commit the xml files. I made a custom url as per requirement from my client default url -- http://localhost/solr/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on my custom url -- http://localhost/search/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on I made the custom url by copy paste the solr.war and renamed it to search.war, so in webapps folder there are two war files this is the cause of my problem, when I use the default url there is no problem at all but when I use my custom url I have to delete, commit, and restart the tomcat to make the query result correctly. the question is now changed :) how to make the search.war behave exactly the same like solr.war? maybe when I start the tomcat I should add some parameter so it will including/pointing to search.war not solr.war anymore? when I removed the solr.war so there is only one war file in webapps folder which is search.war, I can't do commit, it said 'FATAL: Solr returned an error: Not Found' it is because the app searching solr.war not search.war -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1070189.html Sent from the Solr - User mailing list archive at Nabble.com.
delete Problem..
Hallo Users... I have a Problem, to delete some indext Item i Tryed it with : java -Ddata=args -jar /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar deletequeryEMAIL_HEADER_FROM:test.de/query/delete but Nothing, EMAIL_HEADER_FROM is a String and in the past it ever works. but now? I cant delete it. i can delete some mail when i tryed to delet only one like This: java -Ddata=args -jar /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar deletequery4b829265.7010...@test.de.20100803133543/query/delete
Re: Process entire result set
Thanks Jonathan! We decided to create offline results and store them in a Non-sql storage (HBase). So we can answer the requests selecting one the the offline generated results. This offline results are generated everyday. Thanks! Eloi On Thu, Aug 5, 2010 at 8:59 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Eloi Rocha wrote: Hi everybody, I would like to know if does make sense to use Solr in the following scenario: - search for large amount of data (like 1000, 1, 10 registers) - each register contains four or five fields (strings and integers) - every time will request for entire result set (I can paginate the results). It would be much better to get all results at once [...] Depends on what kinds of searching you're doing. Are you doing searching that needs an indexer like Solr? Then Solr is a good tool for your job. Are you not, and you can do what you want just as easily in an rdbms or non-sql store like MongoDB? Then I wouldn't use Solr. Assuming you really do need Solr, I think this should work, but I would not store the actual stored fields in Solr, I'd store those fields in an external store (key-value store, rdbms, whatever). You store only what you need to index in Solr, you do your search, you get ID's back. You ask for the entire result set back, why not. If you give Solr enough RAM, and set your cache settings appropriately (really big document and related caches), then I _think_ it should perform okay. One way to find out. What you'd get back is just ID's, then you'd look up that ID in your external store to get your actual fields you want to operate on. _May_ not be neccesary, maybe you could do it with solr stored fields, but making Solr do only exactly what you really need from it (an index) will maximize it's ability to do what you need in available RAM. If you don't need Solr/Lucene indexing/faceting behavior, and you can do just fine with an rdbms or non-sql store, use that. Jonathan -- Eloi Rocha Neto Melon Tech - http://melontech.com.br +55 83 8868-7025
Re: Indexing fieldvalues with dashes and spaces
Hi, Try solr.KeywordTokenizerFactory. However, in your case it looks as if you have certain requirements for searching that requires tokenization. So you should leave the WhitespaceTokenizer as is and create a separate field specially for the faceting, with indexed=true, stored=false and type=String. I often create a dynamic field for such, e.g. dynamicField name=*_facet... and then do a copyField. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 9. aug. 2010, at 09.54, PeterKerk wrote: Hi Erick, Ok. its more clear now. I indeed have the whitespace tokenizer: fieldType name=textTrue class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Dutch protected=protwords.txt/ /analyzer /fieldType What happens is that I have a field called 'Beach Sea, which is a theme for a location. What happens because of the whitespace tokenizer, it gets split up in 2 fields: Beach,2, Sea,2], (see below) Ofcourse those individual facet names are NOT correct facetnames, because it should be Beach Sea. But if I REMOVE the whitespace tokenizer, it throws an error that a fieldtype should always have a tokenizer. But which tokenizer would I need in order for me to have the correct facet name? (I've been checking this page btw:http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html) facet_counts:{ facet_queries:{}, facet_fields:{ themes:[ Gemeentehuis,2, Beach,2, Sea,2], province:[ gelderland,1, utrecht,1, zuidholland,1], services:[ exclusiev,2, fotoreportag,2, hur,2, liv,1, muziek,1]}, facet_dates:{}}} -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1052554.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Rows fetch OK, Total Documents Failed??
Do you have any required fields or uniqueKey in your schema.xml? Do you provide values for all these fields? AFAIU you don't need commonField attribute for id and title fields. I don't think that's your problem but anyway... On Sat, Jul 31, 2010 at 11:29 AM, scr...@asia.com wrote: Hi, I'm a bit lost with this, i'm trying to import a new XML via DIH, all row are fetched but no ducument are indexed? I don't find any log or error? Any ideas? Here is the STATUS: str name=commandstatus/str str name=statusidle/str str name=importResponse/ lst name=statusMessages str name=Total Requests made to DataSource1/str str name=Total Rows Fetched7554/str str name=Total Documents Skipped0/str str name=Full Dump Started2010-07-31 10:14:33/str str name=Total Documents Processed0/str str name=Total Documents Failed7554/str str name=Time taken 0:0:4.720/str /lst My xml file looks like this: ?xml version=1.0 encoding=UTF-8? products product titleMoniteur VG1930wm 19 LCD Viewsonic/title urlhttp://x.com/abc?a(12073231)p(2822679)prod(89042332277)ttid(5)url(http%3A%2F%2Fwww.ffdsssd.com%2Fproductinformation%2F%7E66297%7E%2Fproduct.htm%26sender%3D2003)/url contentMoniteur VG1930wm 19 LCD Viewsonic VG1930WM/content price247.57/price categoryEcrans/category /product etc... and my dataconfig: dataConfig dataSource type=URLDataSource / document entity name=products url=file:///home/john/Desktop/src.xml processor=XPathEntityProcessor forEach=/products/product transformer=DateFormatTransformer field column=id xpath=/products/product/url commonField=true / field column=title xpath=/products/product/title commonField=true / field column=category xpath=/products/product/category / field column=content xpath=/products/product/content / field column=price xpath=/products/product/price / /entity /document /dataConfig
Re: how to support implicit trailing wildcards
Hi, You don't need to duplicate the content into two fields to achieve this. Try this: q=mount OR mount* The exact match will always get higher score than the wildcard match because wildcard matches uses constant score. Making this work for multi term queries is a bit trickier, but something along these lines: q=(mount OR mount*) AND (everest OR everest*) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote: you could satisfy this by making 2 fields: 1. exactmatch 2. wildcardmatch use copyfield in your schema to copy 1 -- 2 . q=exactmatch:mount+wildcardmatch:mount*q.op=OR this would score exact matches above (solely) wildcard matches Geert-Jan 2010/8/10 yandong yao yydz...@gmail.com Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F , Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer bspit...@magix.net Wildcard-Search is already built in, just use: ?q=umoun* ?q=mounta* -Ursprüngliche Nachricht- Von: yandong yao [mailto:yydz...@gmail.com] Gesendet: Montag, 9. August 2010 15:57 An: solr-user@lucene.apache.org Betreff: how to support implicit trailing wildcards Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
Re: Facet Fields - ID vs. Display Value
If your concern is performance, faceting integers versus faceting strings, I believe Lucene makes the differences negligible. Given that choice I'd go with string. Now..if you need to keep an association between id and string, you may want to facet a combined field id:string or some other delimiter. Then parse it on display. But you can use the id if you need to hit a database or some other external source. If you don't ever need to reference the ID, I wouldn't even put it in the index. -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-Fields-ID-vs-Display-Value-tp1062754p1072067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
Hi, Beware that post.jar is just an example tool to play with the default example index located at /solr/ namespace. It is very limited and you shold look elsewhere for a more production ready and robust tool. However, it has the ability to specify custom url. Please try: java -jar post.jar -help SimplePostTool: version 1.2 This is a simple command line tool for POSTing raw XML to a Solr port. XML data can be read from files specified as commandline args; as raw commandline arg strings; or via STDIN. Examples: java -Ddata=files -jar post.jar *.xml java -Ddata=args -jar post.jar 'deleteid42/id/delete' java -Ddata=stdin -jar post.jar hd.xml Other options controlled by System Properties include the Solr URL to POST to, and whether a commit should be executed. These are the defaults for all System Properties... -Ddata=files -Durl=http://localhost:8983/solr/update -Dcommit=yes Thus for your index, try: java -Durl=http://localhost:80/search/update -jar post.jar myfile.xml -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 10. aug. 2010, at 12.10, e8en wrote: finally I found out the cause of my problem yes you don't need to delete the index and restart the tomcat just to get the data query result updated, you just need to commit the xml files. I made a custom url as per requirement from my client default url -- http://localhost/solr/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on my custom url -- http://localhost/search/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on I made the custom url by copy paste the solr.war and renamed it to search.war, so in webapps folder there are two war files this is the cause of my problem, when I use the default url there is no problem at all but when I use my custom url I have to delete, commit, and restart the tomcat to make the query result correctly. the question is now changed :) how to make the search.war behave exactly the same like solr.war? maybe when I start the tomcat I should add some parameter so it will including/pointing to search.war not solr.war anymore? when I removed the solr.war so there is only one war file in webapps folder which is search.war, I can't do commit, it said 'FATAL: Solr returned an error: Not Found' it is because the app searching solr.war not search.war -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1070189.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: hl.usePhraseHighlighter
Thanks so much for your help! It works. I really appreciate it. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Monday, August 09, 2010 6:05 PM To: solr-user@lucene.apache.org Subject: RE: hl.usePhraseHighlighter I used text type and found the following in schema.xml. I don't know which ones I should remove. *** You should remove filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ from both index and query time.
Re: delete Problem..
Hi, Since EMAIL_HEADER_FROM is a String type, you need to specify the whole field every time. Wildcards could also work, but you'll get a problem with leading wildcards. The solution would be to change the fieldType into a text type using e.g. StandardTokenizerFactory - if this does not break other functionality you need on that field. Then it would support searching part of the field. You should make this as a phrase search to avoid ambiguities. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 10. aug. 2010, at 12.29, Jörg Agatz wrote: Hallo Users... I have a Problem, to delete some indext Item i Tryed it with : java -Ddata=args -jar /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar deletequeryEMAIL_HEADER_FROM:test.de/query/delete but Nothing, EMAIL_HEADER_FROM is a String and in the past it ever works. but now? I cant delete it. i can delete some mail when i tryed to delet only one like This: java -Ddata=args -jar /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar deletequery4b829265.7010...@test.de.20100803133543/query/delete
Re: delete Problem..
I'd try 2 things. First do a query q=EMAIL_HEADER_FROM:test.de and make sure some documents are found. If nothing is found, there is nothing to delete. Second, how are you testing to see if the document is deleted? The physical data isn't removed from the index until you Optimize I believe. Is it possible your delete is working, but your method of verifying isn't telling you it's marked for deletion? -- View this message in context: http://lucene.472066.n3.nabble.com/delete-Problem-tp1070347p1072581.html Sent from the Solr - User mailing list archive at Nabble.com.
Improve Query Time For Large Index
Hi, I have 5 Million small documents/tweets (= ~3GB) and the slave index replicates itself from master every 10-15 minutes, so the index is optimized before querying. We are using solr 1.4.1 (patched with SOLR-1624) via SolrJ. Now the search speed is slow 2s for common terms which hits more than 2 mio docs and acceptable for others: 0.5s. For those numbers I don't use highlighting or facets. I am using the following schema [1] and from luke handler I know that numTerms =~20 mio. The query for common terms stays slow if I retry again and again (no cache improvements). How can I improve the query time for the common terms without using Distributed Search [2] ? Regards, Peter. [1] field name=id type=tlong indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true / !-- term* attributes to prepare faster highlighting. -- field name=txt type=text indexed=true stored=true termVectors=true termPositions=true termOffsets=true/ [2] http://wiki.apache.org/solr/DistributedSearch
Re: Implementing lookups while importing data
We are currently doing this via a JOIN on the numeric field, between the main data table and the lookup table, but this dramatically slows down indexing. I believe SQL JOIN is the fastest and easiest way in your case (in comparison with nested entity even using CachedSqlEntity). You probably don't have proper indexes in your database - check SQL query plan.
PDF file
I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl 'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help!
RE: Improve Query Time For Large Index
Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists. So you need to leave some memory for the OS. On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files. Can you give us some examples of the slow queries? Are you using stop words? If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list or try CommonGrams and add them to the common words list. (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2) Tom Burton-West -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 9:54 AM To: solr-user@lucene.apache.org Subject: Improve Query Time For Large Index Hi, I have 5 Million small documents/tweets (= ~3GB) and the slave index replicates itself from master every 10-15 minutes, so the index is optimized before querying. We are using solr 1.4.1 (patched with SOLR-1624) via SolrJ. Now the search speed is slow 2s for common terms which hits more than 2 mio docs and acceptable for others: 0.5s. For those numbers I don't use highlighting or facets. I am using the following schema [1] and from luke handler I know that numTerms =~20 mio. The query for common terms stays slow if I retry again and again (no cache improvements). How can I improve the query time for the common terms without using Distributed Search [2] ? Regards, Peter. [1] field name=id type=tlong indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true / !-- term* attributes to prepare faster highlighting. -- field name=txt type=text indexed=true stored=true termVectors=true termPositions=true termOffsets=true/ [2] http://wiki.apache.org/solr/DistributedSearch
Re: DIH and multivariable fields problems
Have others successfully imported dynamic multivalued fields in a child entity using the DataImportHandler via the child entity returning multiple records through a RDBMS? Yes, it's working ok with static fields. I didn't even know that it's possible to use variables in field names ( dynamic names ) in DIH configuration. This use case is quite unusual. This is increasingly more looking like a bug. To recap, I am trying to use the DIH to import multivalued dynamic fields and using a variable to name that field. I'm not an expert in DIH source code but it seems there's special processing of dynamic fields that prevents handling field type (and multivalued attribute). Specifically there's conditional jump (continue) over field type detection code in case of dynamic field name ( see DataImporter:initEntity ). I guess the reason of such behavior is that you can't determine field type based on dynamic field name (${variable}_s) at that time (configuration parsing). I'm wondering if it's possible to determine field types at runtime (when actual field title_s name is resolved). I encountered similar problem with implicit sql_column - solr_field mapping using SqlEntityProcessor, i.e. when you select some columns and do not explicitly list all these columns as fields entries in your configuration. In this case field type detection doesn't work either. I think that moving type detection process into runtime would solve that problem also. Am i missing something obvious that prevents us from doing field type detection at runtime? Alex On Tue, Aug 10, 2010 at 4:20 AM, harrysmith harrysmith...@gmail.com wrote: This is increasingly more looking like a bug. To recap, I am trying to use the DIH to import multivalued dynamic fields and using a variable to name that field. Upon further testing, the multivalued import works fine with a static/constant name, but only keeps the first record when naming the field dynamically. See below for relevant snips. From schema.xml : dynamicField name=*_s type=string indexed=true stored=true multiValued=true / From data-config.xml : entity name=terms query=select distinct CORE_DESC_TERM from metadata where item_id=${item.DIVID_PK} entity name=metadata query=select * from metadata where item_id=${item.DIVID_PK} AND core_desc_term='${terms.CORE_DESC_TERM}' field name=metadata_record_s column=TEXT_VALUE / /entity /entity Produces the following, note that there are 3 records that should be returned and are correctly done, with the field name being a constant. - result name=response numFound=1 start=0 - doc str name=id9892962/str - arr name=metadata_record_s strrecord 1/str strrecord 2/str strrecord 3/str strPolygraph Newsletter Title/str /arr - arr name=title strPolygraph Newsletter Title/str /arr /doc /result === Now, changing the field name to a variable..., note only the first record is retained for the 'Relation_s' field -- there should be 3 records. field name=metadata_record_s column=TEXT_VALUE / becomes field name=${terms.CORE_DESC_TERM}_s column=TEXT_VALUE / produces the following: - result name=response numFound=1 start=0 - doc - arr name=Relation_s strrecord 1/str /arr - arr name=Title_s strPolygraph Newsletter Title/str /arr str name=id9892962/str - arr name=title strPolygraph Newsletter Title/str /arr /doc /result Only the first record is retained. There was also another post (which recieved no replies) in the archive that reported the same issue. The DIH debug logs do show 3 records correctly being returned, so somehow these are not getting added. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1065244.html Sent from the Solr - User mailing list archive at Nabble.com.
Need help with facets
Hi guys, I have a solr index whose documents have the following fields: FirstName LastName RecruitedDate I update the index when any of the three fields change for that specific person. I need to get facets based on when someone was recruited. The facets are : Recruited within 1 month Recruited within 3 months ... So if 10 people were recruited within the past month then the count for rRecruited within 1 month will be 10. Is there a way to calculate the facets from RecruitedDate? Or, will I have to create another field (let's say) RecruitedDateFacet and store the text in there? My problem is that if I use a separate field for faceting and store a string in it then if that person's information wasn't updated for a month he would still fall in that category (since no delta query was run) Please advise on what is the best way to accomplish this. Thanks in advance, Moazzam
Re: delete Problem..
Are you running a commit command after every delete command? I had the same problem with updates. I wasn't committing my updates. - Moazzam Khan http://moazzam-khan.com On Tue, Aug 10, 2010 at 8:52 AM, kenf_nc ken.fos...@realestate.com wrote: I'd try 2 things. First do a query q=EMAIL_HEADER_FROM:test.de and make sure some documents are found. If nothing is found, there is nothing to delete. Second, how are you testing to see if the document is deleted? The physical data isn't removed from the index until you Optimize I believe. Is it possible your delete is working, but your method of verifying isn't telling you it's marked for deletion? -- View this message in context: http://lucene.472066.n3.nabble.com/delete-Problem-tp1070347p1072581.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need help with facets
I have a solr index whose documents have the following fields: FirstName LastName RecruitedDate I update the index when any of the three fields change for that specific person. I need to get facets based on when someone was recruited. The facets are : Recruited within 1 month Recruited within 3 months ... So if 10 people were recruited within the past month then the count for rRecruited within 1 month will be 10. Is there a way to calculate the facets from RecruitedDate? It is possible with facet.query; something like: q=*:*facet=onfacet.query=RecruitedDate:[NOW-1MONTH TO NOW]facet.query=RecruitedDate:[NOW-3MONTHS TO NOW]
How to compile nightly build?
I am attempting to follow the instructions located at: http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_Solr_Example I have downloaded the most recent clean build from Hudson. After running 'ant example' I get the following error: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29ant example Buildfile: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\build.xml init-forrest-entities: compile-lucene: BUILD FAILED C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\common-build.xml:214: C:\solr_ build\modules\analysis\common does not exist. Total time: 0 seconds = What is the correct procedure? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1077115.html Sent from the Solr - User mailing list archive at Nabble.com.
Do we need index analyzer for query elevation component
Hello, In order for query elevation we define a type. do we really need index time analyzer for query elevation type. Let say we have some document already indexed and i added only the query time analyzer, looks like solr reads the words in elevate.xml and map words to the respective document. in that case why would we need index time analyzers, unless i am missing something. Please let me know fieldType name=elevateKeywordsType class=solr.TextField positionIncrementGap=100 analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType darniz -- View this message in context: http://lucene.472066.n3.nabble.com/Do-we-need-index-analyzer-for-query-elevation-component-tp1077130p1077130.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: PDF file
Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl 'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help!
Re: Improve Query Time For Large Index
Hi Tom, my index is around 3GB large and I am using 2GB RAM for the JVM although a some more is available. If I am looking into the RAM usage while a slow query runs (via jvisualvm) I see that only 750MB of the JVM RAM is used. Can you give us some examples of the slow queries? for example the empty query solr/select?q= takes very long or solr/select?q=http where 'http' is the most common term Are you using stop words? yes, a lot. I stored them into stopwords.txt http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 this looks interesting. I read through https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4. I only need to enable it via: filter class=solr.CommonGramsFilterFactory ignoreCase=true words=stopwords.txt/ right? Do I need to reindex? Regards, Peter. Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists. So you need to leave some memory for the OS. On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files. Can you give us some examples of the slow queries? Are you using stop words? If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list or try CommonGrams and add them to the common words list. (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2) Tom Burton-West -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 9:54 AM To: solr-user@lucene.apache.org Subject: Improve Query Time For Large Index Hi, I have 5 Million small documents/tweets (= ~3GB) and the slave index replicates itself from master every 10-15 minutes, so the index is optimized before querying. We are using solr 1.4.1 (patched with SOLR-1624) via SolrJ. Now the search speed is slow 2s for common terms which hits more than 2 mio docs and acceptable for others: 0.5s. For those numbers I don't use highlighting or facets. I am using the following schema [1] and from luke handler I know that numTerms =~20 mio. The query for common terms stays slow if I retry again and again (no cache improvements). How can I improve the query time for the common terms without using Distributed Search [2] ? Regards, Peter. [1] field name=id type=tlong indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true / !-- term* attributes to prepare faster highlighting. -- field name=txt type=text indexed=true stored=true termVectors=true termPositions=true termOffsets=true/ [2] http://wiki.apache.org/solr/DistributedSearch -- http://karussell.wordpress.com/
RE: PDF file
Xiaohui, You need to add the following jars to the lib subdirectory of the solr config directory on your server. (path inside the solr 1.4.1 download) /dist/apache-solr-cell-1.4.1.jar plus all the jars in /contrib/extraction/lib HTH -Jon From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov] Sent: Tuesday, August 10, 2010 11:57 AM To: 'solr-user@lucene.apache.org' Subject: RE: PDF file Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl 'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help! - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
RE: PDF file
Thanks so much for your help! I tried to index a pdf file and got the following. The command I used is curl 'http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf Did I do something wrong? Do I need modify anything in schema.xml or other configuration file? [xiao...@lhcinternal lhc]$ curl 'http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 /title /head bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre pRequestURI=/solr/lhc/update/extract/ppismalla href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html *** -Original Message- From: Sharp, Jonathan [mailto:jsh...@coh.org] Sent: Tuesday, August 10, 2010 4:37 PM To: solr-user@lucene.apache.org Subject: RE: PDF file Xiaohui, You need to add the following jars to the lib subdirectory of the solr config directory on your server. (path inside the solr 1.4.1 download) /dist/apache-solr-cell-1.4.1.jar plus all the jars in /contrib/extraction/lib HTH -Jon From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov] Sent: Tuesday, August 10, 2010 11:57 AM To: 'solr-user@lucene.apache.org' Subject: RE: PDF file Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl 'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help! - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
Re: Need help with facets
Thanks Ahmet that worked! Here's another issues I have : Like I said before, I have these fields in Solr documents FirstName LastName RecruitedDate VolumeDate (just added this in this email) VolumeDone (just added this in this email) Now I have to get sum of all VolumeDone (integer field) for this month by everyone, then take 25% of that number and get all people whose volume was more than that. Is there a way to do this? :D I did some research but I wasn't able to come up with an answer. Thanks, Moazzam On Tue, Aug 10, 2010 at 1:42 PM, Ahmet Arslan iori...@yahoo.com wrote: I have a solr index whose documents have the following fields: FirstName LastName RecruitedDate I update the index when any of the three fields change for that specific person. I need to get facets based on when someone was recruited. The facets are : Recruited within 1 month Recruited within 3 months ... So if 10 people were recruited within the past month then the count for rRecruited within 1 month will be 10. Is there a way to calculate the facets from RecruitedDate? It is possible with facet.query; something like: q=*:*facet=onfacet.query=RecruitedDate:[NOW-1MONTH TO NOW]facet.query=RecruitedDate:[NOW-3MONTHS TO NOW]
Re: How to compile nightly build?
You don't have to download the source. You can just download the binary distribution from their site and run it without compiling it. - Moazzam On Tue, Aug 10, 2010 at 1:48 PM, harrysmith harrysmith...@gmail.com wrote: I am attempting to follow the instructions located at: http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_Solr_Example I have downloaded the most recent clean build from Hudson. After running 'ant example' I get the following error: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29ant example Buildfile: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\build.xml init-forrest-entities: compile-lucene: BUILD FAILED C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\common-build.xml:214: C:\solr_ build\modules\analysis\common does not exist. Total time: 0 seconds = What is the correct procedure? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1077115.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Modifications to AbstractSubTypeFieldType
Compound types are young and will probably mutate. I will do my own hack until things settle down. Lance On Mon, Jul 12, 2010 at 12:47 AM, Mark Allan mark.al...@ed.ac.uk wrote: On 7 Jul 2010, at 6:24 pm, Yonik Seeley wrote: On Wed, Jul 7, 2010 at 8:15 AM, Grant Ingersoll gsing...@apache.org wrote: Originally, I had intended that it was just for one Field Sub Type, thinking that if we ever wanted multiple sub types, that a new, separate class would be needed Right - this was my original thinking too. AbstractSubTypeFieldType is only a convenience class to create compound types... people can do it other ways. Just for clarification, does that mean my modifications won't be included? If so, can you let me know so that I can extract the changes and maintain them in a different package structure from the main Solr code please. Cheers Mark -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Lance Norskog goks...@gmail.com
Re: How to compile nightly build?
In this particular case I would like to get the trunk. Is there a different link for binary distributions of nightly builds? I had been downloading from here: http://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/ In the case I did want to compile from the source, am I missing a step? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1080266.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4 - stats page slow
Apologies if this was resolved, but we just deployed Solr 1.4.1 and the stats page takes over a minute to load for us as well and began causing OutOfMemory errors so we've had to refrain from hitting the page. From what I gather, it is the fieldCache part that's causing it. Was there ever an official fix or recommendation on how to disable the stats page from calculating the fieldCache entries? If we could just ignore it, I think we'd be good to go since I find this page very useful otherwise. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-1-4-stats-page-slow-tp498810p1081193.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: PDF file
Try ... curl http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file= Full_Path_of_File/pub2009001.pdfliteral.id=777045commit=true stream.file - specify full path literal.extra params - specify any extra params if needed Regards, Jayendra On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov wrote: Thanks so much for your help! I tried to index a pdf file and got the following. The command I used is curl ' http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf Did I do something wrong? Do I need modify anything in schema.xml or other configuration file? [xiao...@lhcinternal lhc]$ curl ' http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 /title /head bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre pRequestURI=/solr/lhc/update/extract/ppismalla href= http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html *** -Original Message- From: Sharp, Jonathan [mailto:jsh...@coh.org] Sent: Tuesday, August 10, 2010 4:37 PM To: solr-user@lucene.apache.org Subject: RE: PDF file Xiaohui, You need to add the following jars to the lib subdirectory of the solr config directory on your server. (path inside the solr 1.4.1 download) /dist/apache-solr-cell-1.4.1.jar plus all the jars in /contrib/extraction/lib HTH -Jon From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov] Sent: Tuesday, August 10, 2010 11:57 AM To: 'solr-user@lucene.apache.org' Subject: RE: PDF file Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl ' http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help! - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
Re: DIH and multivariable fields problems
Glad I could help. I also would think it was a very common issue. Personally my schema is almost all dynamic fields. I have unique_id, content, last_update_date and maybe one other field specifically defined, the rest are all dynamic. This lets me accept an almost endless variety of document types into the same schema. So if I planned on using DIH I had to come up with a way, and stitching together solutions to a couple related issues got me to my script transform. Mine is more convoluted than the one I gave here, but obviously you got the gist of the idea. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr query result not read the latest xml file
thanks for you response Jan, I just knew that the post.jar only an example tool so what should I use if not post.jar for production? btw, I already tried using this command: java -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml and IT WORKS !! the cat_817.xml reflected directly in the solr query after I commit the cat_817.xml, this is the url: http://localhost:8983/search/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on the problem is it works if the old xml contain less doc than the new xml, for example if the old cat_817.xml contain 2 doc and the new cat_817.xml contain 10 doc then I just have to re-index (java -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml) and it the query result will have correct result (10 doc), but it doesn't work vice versa. If the old cat_817.xml contain 10 doc and the new cat_817.xml contain 2 doc, then I have to delete the index first (java -Ddata=args -Dcommit=yes -jar post.jar deletequeryITEM_CAT:817/query/delete) and re-index it (java -Durl=http://localhost:8983/search/update -jar post.jar cat_817.xml) to make the query result updated (2 doc). is it a normal process or something wrong with my solr? once again thanks again Jan, your help really make my day brighter :) and I believe your answer will help many solr newbie especially me -- View this message in context: http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1081802.html Sent from the Solr - User mailing list archive at Nabble.com.