Re: performance crossover between single index and sharding
Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? Regards, Bernd Am 03.08.2011 16:33, schrieb Shawn Heisey: Replies inline. On 8/3/2011 2:24 AM, Bernd Fehling wrote: To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times An average query time of 50 milliseconds isn't too bad. If the number from your Solr setup below (39.5) is the QTime, then Solr thinks it is performing better, but Solr's QTime does not include absolutely everything that hs to happen. Do you by chance have 95th and 99th percentile query times for either system? And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours I can't tell whether you mean that each physical host has 32GB or each VM has 32GB. You want to be sure that you are not oversubscribing your memory. If you can get more memory in your machines, you really should. Do you know whether that 0.6 seconds is most of the delay that a user sees when making a search request, or are there other things going on that contribute more delay? In our webapp, the Solr request time is usually small compared with everything else the server and the user's browser are doing to render the results page. As much as I hate being the tall pole in the tent, I look forward to the day when the developers can change that balance. The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go with sharding. I think we are already far behind the point of search performance crossover. What I hope to get with sharding: - reduce time for building the index - reduce average time per request You will probably achieve both of these things by sharding, especially if you have a lot of CPU cores available. Like mine, your query volume is very low, so the CPU cores are better utilized distributing the search. What I fear with sharding: - i currently have master/slave, do I then have e.g. 3 master and 3 slaves? - the query changes because of sharding (is there a search distributor?) - how to distribute the content the indexer with DIH on 3 server? - anything else to think about while changing to sharding? I
Re: segment.gen file is not replicated
I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex
Hi Erick, thanks a lot! This looks like a good idea: Our queries with the changeable fields fits the join-idea from https://issues.apache.org/jira/browse/SOLR-2272 because - we do not need relevance ranking - we can separate in a conjunction of a query with the changeable fields and our other stable fields So we can use something like q=stablefields:query1fq={!join from=changeable_fields_doc_id to:stable_fields_doc_id}changeablefields:query2 Only disprofit from the solution with ParallelReader is, that our stored fields and vector terms will be divided on two lucene-docs, which is ok in our use-case. Best regards Karsten in context: http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html Original-Nachricht Datum: Wed, 3 Aug 2011 22:11:08 -0400 Von: Erick Erickson erickerick...@gmail.com An: solr-user@lucene.apache.org Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex Hmmm, the only thing that comes to mind is the join feature being added to Solr 4.x, but I confess I'm not entirely familiar with that functionality so can't tell if it really solver your problem. Other than that I'm out of ideas, but the again it's late and I'm tired so maybe I'm not being very creative G... Best Erick On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:
Re: segment.gen file is not replicated
This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: A rant about field collapsing
The development of the field collapse feature is a long and confusing story. The main point is that SOLR-236 was never going to scale and the performance in general was bad. A new approach was needed. This was implemented in SOLR-1682 and added to the trunk (4.0-dev) around September last year. Later in LUCENE-1421 the code was moved from Solr to Lucene as a module / contrib. After that the grouping module and contrib were wired into Solr 3.3 and 4.0-dev in SOLR-2524 and SOLR-2564. Field collapsing is not gone, but it is just a form of result grouping. The core SOLR-236 feature is in 3.3 / 4.0-dev. Other features that SOLR-236 offered will eventually get in. Like for example post grouping facets. The http parameters and response have changed without keeping it compatible with the SOLR-236 patches. I think that isn't a problem since SOLR-236 was never a committed feature. But a widely used feature should never be attached as a patch to a Jira issue for 3+ years On 3 August 2011 18:33, baronDodd barond...@googlemail.com wrote: I am working on an implementation of search within our application using solr. About 2 months ago we had the need to group results by a certain field. After some searching I came across the JIRA in progress for this - field collapsing: https://issues.apache.org/jira/browse/SOLR-236 It was scheduled for the next solr release and had a full set of proper JIRA subtasks and patch files of almost complete implementations attached. So as you can imagine I was happy to apply this patch and build it into our application and await for the next release when it would be part of the main trunk. Now imagine my surprise when we have come around to upgrade to see that suddenly field collapsing has been thrown away in favour of a totally different grouping implementation https://issues.apache.org/jira/browse/SOLR-2524 How was it decided that this would be used instead? It was not made very clear that LUCENE-1421 was in progress which would effectively make the field collapsing work irrelevant by fixing the problem in lucene rather than primarily in solr. This has cost me days of work to now merge our custom changes somehow to the new implementation. I guess it is my own fault for basing our custom changes around an unresolved enhancement but as SOLR-236 had been 3-4 years in progress and SOLR-2524 did not exist at the time it seemed pretty safe to assume that the same problem was not being fixed in 2 totally different ways! -- View this message in context: http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3222798.html Sent from the Solr - User mailing list archive at Nabble.com. -- Met vriendelijke groet, Martijn van Groningen
Re: Solr 3.3 crashes after ~18 hours?
Thank you for the many replies! Like I said, I couldn't find anything in logs created by solr. I just had a look at the /var/logs/messages and there wasn't anything either. What I mean by crash is that the process is still there and http GET pings would return 200 but when i try visiting /solr/admin, I'd get a blank page! The server ignores any incoming updates or commits, thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. I just gave allocated more memory like proposed and will keep an eye on it if the problem still persists. Thank you guys, you are awesome. Am 02.08.2011 15:23, schrieb François Schiettecatte: Assuming you are running on Linux, you might want to check /var/log/messages too (the location might vary), I think the kernel logs forced process termination there. I recall that the kernel will usually picks the process consuming the most memory, there may be other factors involved too. François On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote: Monitor your memory usage. I use to encounter a problem like this before where nothing was in the logs and the process was just gone. Turned out my system was out odd memory and swap got used up because of another process which then forced the kernel to start killing off processes. Google OOM linux and you will find plenty of other programs and people with a similar problem. Cameron On Aug 2, 2011 6:02 AM, alexander sulza.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: segment.gen file is not replicated
Am 04.08.2011 12:52, schrieb Michael McCandless: This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? That was my first idea. If not changed and touched then it will be skipped. While being smart I deleted it on slave from index dir and then replicated, but segment.gen was not replicated. Due to your explanation NFS could not be reliable any more. So my idea either a bug or a feature and the experts will know :-) Regards Bernd Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)
Concerning the downtime, we found a solution that works well for us. We allready implemented an update mechanism so that when authors are changing some content in the cms, the index regarding this piece of content gets updated (delete than index again) as well. All we had to do is: 1. Change the schema.xml to support the PhoneticFilter in certain fieldtypes 2. Write a script that finds all individual content items 3. Starting the update mechanism for each piece of content item on after another. So the index slowly emerges from the old to the new phonetic state without any noticeable downtime for users using the search function. Its just that they get kind of mixed results for the time of the transition. Sure it needs some time, but we can have cms users working with content all the time. If they create or update content during the transition it will be indexed, reindexed followinf the new schema.xml anyway. If we need to rollback we just replace the schema.xml with the old version and start the update process again. So far this is working, thanks for your support! -- View this message in context: http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3225223.html Sent from the Solr - User mailing list archive at Nabble.com.
Unbuffered entity enclosing request can not be repeated Invalid chunk header
Hello folks, i use solr 1.4.1 and every 2 to 6 hours i have indexing errors in my log files. on the client side: 2011-08-04 12:01:18,966 ERROR [Worker-242] IndexServiceImpl - Indexing failed with SolrServerException. Details: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.: Stacktrace: org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:469) . . on the server side: INFO: [] webapp=/solr path=/update params={wt=javabinversion=1} status=0 QTime=3 04.08.2011 12:01:18 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 0 04.08.2011 12:01:18 org.apache.solr.common.SolrException log SCHWERWIEGEND: org.apache.solr.common.SolrException: java.io.IOException: Invalid chunk header . . . i`m indexing ONE document per call, 15-20 documents per second, 24/7. what may be the problem? best regards vadim
Re: performance crossover between single index and sharding
We have 16 shards on 4 physical servers. Shard size was determined by measuring query response times as a function of doc count. Multiple shards per server provides parallelism. In a VM environment, I would lean towards 1 shard per VM (with 1/4 the RAM). We implemented our own distributed search (pre-Solr) and the extra sort/merge processing is not a performance issue. Peter On Tue, Aug 2, 2011 at 2:35 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Jonothan and Markus, Why 3 shards on one machine instead of one larger shard per machine? Good question! We made this architectural decision several years ago and I'm not remembering the rationale at the moment. I believe we originally made the decision due to some tests showing a sweetspot for I/O performance for shards with 500,000-600,000 documents, but those tests were made before we implemented CommonGrams and when we were still using attached storage. I think we also might have had concerns about Java OOM errors with a really large shard/index, but we now know that we can keep memory usage under control by tweaking the amount of the terms index that gets read into memory. We should probably do some tests and revisit the question. The reason we don't have 12 shards on 12 machines is that current performance is good enough that we can't justify buying 8 more machines:) Tom -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Tom, Very interesting indeed! But i keep wondering why some engineers choose to store multiple shards of the same index on the same machine, there must be significant overhead. The only reason i can think of is ease of maintenance in moving shards to a separate physical machine. I know that rearranging the shard topology can be a real pain in a large existing cluster (e.g. consistent hashing is not consistent anymore and having to shuffle docs to their new shard), is this the reason you choose this approach? Cheers, bble.com.
RE: performance crossover between single index and sharding
Dumb question time - you are using a 64 bit Java, and not a 32 bit Java? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Sent: Thursday, August 04, 2011 2:39 AM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? Regards, Bernd Am 03.08.2011 16:33, schrieb Shawn Heisey: Replies inline. On 8/3/2011 2:24 AM, Bernd Fehling wrote: To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times An average query time of 50 milliseconds isn't too bad. If the number from your Solr setup below (39.5) is the QTime, then Solr thinks it is performing better, but Solr's QTime does not include absolutely everything that hs to happen. Do you by chance have 95th and 99th percentile query times for either system? And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours I can't tell whether you mean that each physical host has 32GB or each VM has 32GB. You want to be sure that you are not oversubscribing your memory. If you can get more memory in your machines, you really should. Do you know whether that 0.6 seconds is most of the delay that a user sees when making a search request, or are there other things going on that contribute more delay? In our webapp, the Solr request time is usually small compared with everything else the server and the user's browser are doing to render the results page. As much as I hate being the tall pole in the tent, I look forward to the day when the developers can change that balance. The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go with sharding. I think we are already far behind the point of search performance crossover. What I hope to get with sharding: - reduce time
Re: performance crossover between single index and sharding
java version 1.6.0_21 Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) java: file format elf64-x86-64 Including the -d64 switch. Am 04.08.2011 14:40, schrieb Bob Sandiford: Dumb question time - you are using a 64 bit Java, and not a 32 bit Java? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Sent: Thursday, August 04, 2011 2:39 AM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? Regards, Bernd Am 03.08.2011 16:33, schrieb Shawn Heisey: Replies inline. On 8/3/2011 2:24 AM, Bernd Fehling wrote: To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times An average query time of 50 milliseconds isn't too bad. If the number from your Solr setup below (39.5) is the QTime, then Solr thinks it is performing better, but Solr's QTime does not include absolutely everything that hs to happen. Do you by chance have 95th and 99th percentile query times for either system? And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours I can't tell whether you mean that each physical host has 32GB or each VM has 32GB. You want to be sure that you are not oversubscribing your memory. If you can get more memory in your machines, you really should. Do you know whether that 0.6 seconds is most of the delay that a user sees when making a search request, or are there other things going on that contribute more delay? In our webapp, the Solr request time is usually small compared with everything else the server and the user's browser are doing to render the results page. As much as I hate being the tall pole in the tent, I look forward to the day when the developers can change that balance. The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go with
Re: A rant about field collapsing
Ok thank you very much for clearing that up a little. I think another reason I was confused was that the wiki page for grouping was based around the original field collapsing plan at the time which led me to the jira and hence the patch files, rant over! Perhaps you can help to clarify if the current grouping changes work with solrj? In QueryResponse.setResponse() there is a loop which builds up the results object, but has no check at present for grouped in the NamedList, so the solrj client gets no results back when searching with grouping parameters. I assume I can add this on my local working copy and all will be well? -- View this message in context: http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3225361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: A rant about field collapsing
Well, the original page moved to: http://wiki.apache.org/solr/FieldCollapsingUncommitted Assuming that you're using Solr 3.3 you can't get the grouped result (lst name=grouped) with SolrJ. I added grouping support to SolrJ some time ago and will be in Solr 3.4. You can use a nightly 3.x build to use the grouping support now. You can also use the group.main=true option, that returns a response that is compatible with the normal search response. However you can only use one group command per request (group.field, group.func and group.query). Also there were some bugs with this response format in Solr 3.3 that have been fixed and will be included when Solr 3.4 is released. Martijn On 4 August 2011 15:24, baronDodd barond...@googlemail.com wrote: Ok thank you very much for clearing that up a little. I think another reason I was confused was that the wiki page for grouping was based around the original field collapsing plan at the time which led me to the jira and hence the patch files, rant over! Perhaps you can help to clarify if the current grouping changes work with solrj? In QueryResponse.setResponse() there is a loop which builds up the results object, but has no check at present for grouped in the NamedList, so the solrj client gets no results back when searching with grouping parameters. I assume I can add this on my local working copy and all will be well? -- View this message in context: http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3225361.html Sent from the Solr - User mailing list archive at Nabble.com. -- Met vriendelijke groet, Martijn van Groningen
Re: Solr 3.3 crashes after ~18 hours?
On Thu, Aug 4, 2011 at 8:09 AM, alexander sulz a.s...@digiconcept.net wrote: Thank you for the many replies! Like I said, I couldn't find anything in logs created by solr. I just had a look at the /var/logs/messages and there wasn't anything either. What I mean by crash is that the process is still there and http GET pings would return 200 but when i try visiting /solr/admin, I'd get a blank page! The server ignores any incoming updates or commits, ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com
RE: Strategies for sorting by array, when you can't sort by array?
For anyone who comes across this topic in the future, I solved the problem this way: by agreement with the stakeholders, on the presumption that no one would look at more than 5000 records, I modified my search code so that, if the user selected to sort by the name, I set the row count to return (query.setRows) to 5000. I then put all the result records into a list, sort it, then, depending on what page they're on, extract that subset of the 5000 and return it. There is a small performance hit on initial searching for common names (e.g. Smith, Jones, etc.), but the performance is still far more acceptable than the legacy system Solr is meant to replace (a few seconds as opposed to twenty(!) minutes). Most certainly there are better ways, but this one worked for me, and wanted to make sure it was added to the pool of options for anyone who comes across this problem in the future. Thanks to everyone who offered suggestions! Ron -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, August 03, 2011 11:36 AM To: solr-user@lucene.apache.org Cc: Olson, Ron Subject: Re: Strategies for sorting by array, when you can't sort by array? Not so much that it's a corner case in the sense of being unusual neccesarily (I'm not sure), it's just something that fundamentally doesn't fit well into lucene's architecture. I'm not sure that filing a JIRA will be much use, it's really unclear how one would get lucene to do this, it would be signficant work to do, and it's unlikely any Solr developer is going to decide to spend signficant time on it unless they need it for their own clients. On 8/3/2011 11:40 AM, Olson, Ron wrote: *Sigh*...I had thought maybe reversing it would work, but that would require creating a whole new index, on a separate core, as the existing index is used for other purposes. Plus, given the volume of data, that would be a big deal, update-wise. What would be better would be to remove that particular sort option-button on the webpage. ;) I'll create a Jira issue, but in the meanwhile I'll have to come up with something else. I guess I didn't realize how much of a corner case this problem is. :) Thanks for the suggestions! Ron -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Wednesday, August 03, 2011 10:26 AM To: solr-user@lucene.apache.org Subject: Re: Strategies for sorting by array, when you can't sort by array? Hi Ron. This is an interesting problem you have. One idea would be to create an index with the entity relationship going in the other direction. So instead of one to many, go many to one. You would end up with multiple documents with varying names but repeated parent entity information -- perhaps simply using just an ID which is used as a lookup. Do a search on this name field, sorting by a non-tokenized variant of the name field. Use Result-Grouping to consolidate multiple matches of a name to the same parent document. This whole idea might very well be academic since duplicating all the parent entity information for searching on that too might be a bit much than you care to bother with. And I don't think Solr 4's join feature addresses this use case. In the end, I think Solr could be modified to support this, with some work. It would make a good feature request in JIRA. ~ David Smiley On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote: Hi all- Well, this is a problem. I have a list of names as a multi-valued field and I am searching on this field and need to return the results sorted. I know from searching and reading the documentation (and getting the error) that sorting on a multi-valued field isn't possible. Okay, so, what I haven't found is any real good solution/workaround to the problem. I was wondering what strategies others have done to overcome this particular situation; collapsing the individual names into a single field with copyField doesn't work because the name searched may not be the first name in the field. Thanks for any hints/tips/tricks. Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you. DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee
Re: performance crossover between single index and sharding
On 8/4/2011 12:38 AM, Bernd Fehling wrote: Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. Solr reports all query times in milliseconds. 39.5 would be 0.0395 seconds. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? The Solr statistics normally page reports averages, but not percentile statistics. You can add percentile-based statistics (on a limited subset of your queries) to a 3.X or trunk (4.0) version with SOLR-1972. I am using this patch in production. Alternatively, you can use INFO logging in Solr and crawl the logfiles to gather statistics. In the list below, (the standard section on the stats page) the ones that start with rolling are provided by the patch, the others are included by default. Remember that all these times are in milliseconds. handlerStart : 1312433464327 requests : 24112 errors : 547 timeouts : 0 totalTime : 2565584 avgTimePerRequest : 106.40279 avgRequestsPerSecond : 0.7097045 rollingRequests : 16384 rollingTotalTime : 1594420 rollingAvgTimePerRequest : 97.315674 rollingAvgRequestsPerSecond : 0.74394274 rollingMedian : 16 rolling75thPercentile : 35 rolling95thPercentile : 225 rolling99thPercentile : 2202 rollingMax : 9397 About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. That doesn't leave much RAM for the OS disk cache, the primary way to speed things up with Solr. You should check how long it takes to warm your caches when you commit, you can find that on the stats page. It's probably a good idea to lower your autowarmCount values. If you sharded, you could drop your Java heap size and get more of your index into RAM. I have a heap size of 3GB for an 18.25GB index (total of all shards is about 110GB) and do not expect to be increasing that unless we have problems when we start using faceting, spellchecking, and suggestions. I have made particular tweaks to garbage collection and wrote about my experiences on this list. My memory-related java parameters: -Xms3072M -Xmx3072M -XX:NewSize=2048M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. When I read that before, I thought you were saying it was 0.6 seconds per request, not requests per second. My apologies. A qps of 3 is quite low. I've seen numbers mentioned here above 3 qps, and I'm sure some of the list veterans have seen much higher. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. If you don't create a core to automatically add the shards parameter (the master server idea), your application will have to include the parameter on every request, which means it must be aware of how you have sharded your index. If that's acceptable to you, there's no problem. In my case, every single Solr instance has a copy of this broker core. I only use it on two of them, the two that the load balancer knows about. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? There is currently no way to have Solr figure out distributed indexing. Solr doesn't know how you have sharded your data, and it cannot keep track of primary/secondary indexers. Your build system must figure these things out. My dih-config.xml accepts variables via the URL, which I use to tailor my SQL queries. SELECT * FROM ${dataimporter.request.dataView} WHERE ( ( did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} ) ${dataimporter.request.extraWhere} ) AND (crc32(did) % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) I index all new content to a smaller index which I have called the incremental. The updates that run every two minutes include a modVal for the above query of 0 1 2 3 4 5. Once a night, I figure out which content is older than one week. I
Re: Solr 3.3 crashes after ~18 hours?
Check out Physcial memory/virtual memory usage. RAM usage might be less but Physical memory usage goes up as you index more documents. It might be because of MMapDirectory which used MappedByteBuffer. On Thu, Aug 4, 2011 at 7:38 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Aug 4, 2011 at 8:09 AM, alexander sulz a.s...@digiconcept.net wrote: Thank you for the many replies! Like I said, I couldn't find anything in logs created by solr. I just had a look at the /var/logs/messages and there wasn't anything either. What I mean by crash is that the process is still there and http GET pings would return 200 but when i try visiting /solr/admin, I'd get a blank page! The server ignores any incoming updates or commits, ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com
RE: Joining on multi valued fields
Hi Yonik So I tested the join using the sample data below and the latest trunk. I still got the same behaviour. HOWEVER! In this case it was nothing to do with the patch or solr version. It was the tokeniser splitting G1 into G and 1. So thank you for a nice patch and your suggestions. I do have a couple of questions for you: At what level does the join happen and what do you expect the performance penalty to be. We might use this extensively if the performance penalty isn't great. Thanks again, Matt -Original Message- From: Fowler, Matthew (Markets Eikon) Sent: 03 August 2011 15:04 To: yo...@lucidimagination.com Cc: solr-user@lucene.apache.org Subject: RE: Joining on multi valued fields No I haven't. I will get the latest out of the trunk and report back. Cheers again, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 03 August 2011 14:51 To: Fowler, Matthew (Markets Eikon) Cc: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields Hmmm, if these are real responses from a solr server at rest (i.e. documents not being changed between queries) then what you show definitely looks like a bug. That's interesting, since TestJoin implements a random test that should cover cases like this pretty well. I assume you are using a version of trunk (4.0-dev) and not just the actual attached to the JIRA issue (which IIRC had at least one bug... SOLR-2521). Have you tried a more recent version of trunk? -Yonik http://www.lucidimagination.com On Wed, Aug 3, 2011 at 7:00 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik Sorry for my late reply. I have been trying to get to the bottom of this but I'm getting inconsistent behaviour. Here's an example: Query = pi:rcs100 - Here going to use pid_rcs as join value result name=response numFound=1 start=0 doc str name=pircs100/str str name=ctrcs/str str name=pid_rcsG1/str str name=name_rcsEmerging Market Countries/str str name=definition_rcsAll business events relating to companies and other issuers of securities./str /doc /result /response Query = code:G1 - See how many docs have G1 in their code field. Notice that code is multi valued - result name=response numFound=2 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc /result /response Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join from=pid_rcs to=code}pi:rcs100 - result name=response numFound=3 start=0 - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF3wGpXk+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:57Z/date str name=pinCIF7YcLP+1029782/str - arr name=code strG1/str strG7U/str strGK/str strME7/str strME8/str strMN/str strMR/str /arr /doc - doc str name=ctcat/str date name=maindocdate2011-04-22T05:48:58Z/date str name=pinCN1763203+1029782/str - arr name=code strA2/str strA5/str strA9/str strAN/str strB125/str strB126/str strB130/str strBL63/str strG41/str strGK/str strMZ/str /arr /doc /result /response So as you can see I get back 3 results when only 2 match the criteria. i.e. docs where G1 is present in multi valued code field. Why should the last document be included in the result of the join? Thank you, Matt -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: 01 August 2011 18:28 To: solr-user@lucene.apache.org Subject: Re: Joining on multi valued fields On Mon, Aug 1, 2011 at 12:58 PM, matthew.fow...@thomsonreuters.com wrote: I have been using the JOIN patch https://issues.apache.org/jira/browse/SOLR-2272 with great success. However I have hit a case where it doesn't seem to be working. It doesn't seem to work when joining to a multi-valued field. That should work (and the unit tests do test with multi-valued fields). Can you come up with a simple example where you are not getting the expected results? -Yonik http://www.lucidimagination.com This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters. This email was sent to you by Thomson Reuters, the global news and information
Indexing tweet and searching @keyword OR #keyword
I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq
Re: Joining on multi valued fields
On Thu, Aug 4, 2011 at 11:21 AM, matthew.fow...@thomsonreuters.com wrote: Hi Yonik So I tested the join using the sample data below and the latest trunk. I still got the same behaviour. HOWEVER! In this case it was nothing to do with the patch or solr version. It was the tokeniser splitting G1 into G and 1. Ah, glad you figured it out! So thank you for a nice patch and your suggestions. I do have a couple of questions for you: At what level does the join happen and what do you expect the performance penalty to be. We might use this extensively if the performance penalty isn't great. With the current implementation, the performance is proportional to the number of unique terms in the fields being joined. -Yonik http://www.lucidimagination.com
Re: Indexing tweet and searching @keyword OR #keyword
It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType
How can I create a good autosuggest list with phrases?
I'm at the point in my Solr deployment where I want to start using it for autosuggest, but I've run into a snag. Because the fields that I want to use for autosuggest are tokenized, I can only get single terms out of it. I would like to have it find common phrases that are between two and five words long, so that if someone starts typing ang their autosuggest list will include Angelina Jolie as well as possibly Brad Pitt and Angelina Jolie. My index is already quite large, so I do not want to add shingles. I tried to use the clustering component, but that will only give you halfway decent results if you make the rows= parameter absolutely huge and therefore things run very slowly. Also, it only works against stored fields, so I can only run it against the field where we retrieve captions, not the full description. It's impractical to get results based on an entire index, much less all seven shards. I'm OK with offline analysis to generate a list of suggestions, and I'm also OK with doing that analysis against the MySQL data source rather than Solr. I just need some pointers about what software and/or techniques I can use to generate a good list, and then some idea of how to configure Solr to use that list. Can anyone help? Thanks, Shawn
Re: How can I create a good autosuggest list with phrases?
We handled similar requirement in our product kitchendaily.com by creating a list of Search terms which were frequently searched over a period of time and then building auto-suggestion index from this data. The constant updates of this will allow you to support a well formed auto-suggest feature. This is a good and faster solution if you have application logs to start with and not very high volume of data. Or you can search Solr with the user entered data, which returns all the matching results and boost the data by field which will be used in AutoSuggest box, use top 5 items in the dynamic div. Hope it Helps. -param On 8/4/11 11:42 AM, Shawn Heisey s...@elyograg.org wrote: I'm at the point in my Solr deployment where I want to start using it for autosuggest, but I've run into a snag. Because the fields that I want to use for autosuggest are tokenized, I can only get single terms out of it. I would like to have it find common phrases that are between two and five words long, so that if someone starts typing ang their autosuggest list will include Angelina Jolie as well as possibly Brad Pitt and Angelina Jolie. My index is already quite large, so I do not want to add shingles. I tried to use the clustering component, but that will only give you halfway decent results if you make the rows= parameter absolutely huge and therefore things run very slowly. Also, it only works against stored fields, so I can only run it against the field where we retrieve captions, not the full description. It's impractical to get results based on an entire index, much less all seven shards. I'm OK with offline analysis to generate a list of suggestions, and I'm also OK with doing that analysis against the MySQL data source rather than Solr. I just need some pointers about what software and/or techniques I can use to generate a good list, and then some idea of how to configure Solr to use that list. Can anyone help? Thanks, Shawn
Re: How can I create a good autosuggest list with phrases?
On 8/4/2011 10:04 AM, Sethi, Parampreet wrote: We handled similar requirement in our product kitchendaily.com by creating a list of Search terms which were frequently searched over a period of time and then building auto-suggestion index from this data. The constant updates of this will allow you to support a well formed auto-suggest feature. This is a good and faster solution if you have application logs to start with and not very high volume of data. I do have some separate plans to include data from our query logs, but I'd also like to get data from the index itself, more than one term at a time. Thanks, Shawn
merge factor performance
Hi, We are having a requirement where we are having almost 100,000 documents to be indexed (atleast 20 fields). These fields are not having length greater than 10 KB. Also we are running parallel search for the same index. We found that it is taking almost 3 min to index the entire documents. Strategy what we are doing is that We are making a commit after 15000 docs (single large xml doc) We are having merge factor of 10 as if now I am wondering if increasing the merge factor to 25 or 50 would increase the performance. also what about RAM Size (default is 32 MB) ? Which other factors we need to consider ? When should we consider optimize ? Any other deviation from default would help us in achieving the target. We are allocating JVM max heap size allocation 512 MB, default concurrent mark sweep is set for garbage collection. Thanks Naveen
Re: merge factor performance
Sorry for 15k Docs, it is taking 3 mins. On Thu, Aug 4, 2011 at 10:07 PM, Naveen Gupta nkgiit...@gmail.com wrote: Hi, We are having a requirement where we are having almost 100,000 documents to be indexed (atleast 20 fields). These fields are not having length greater than 10 KB. Also we are running parallel search for the same index. We found that it is taking almost 3 min to index the entire documents. Strategy what we are doing is that We are making a commit after 15000 docs (single large xml doc) We are having merge factor of 10 as if now I am wondering if increasing the merge factor to 25 or 50 would increase the performance. also what about RAM Size (default is 32 MB) ? Which other factors we need to consider ? When should we consider optimize ? Any other deviation from default would help us in achieving the target. We are allocating JVM max heap size allocation 512 MB, default concurrent mark sweep is set for garbage collection. Thanks Naveen
Re: segment.gen file is not replicated
I think we should fix replication to copy it? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 8:16 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Am 04.08.2011 12:52, schrieb Michael McCandless: This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? That was my first idea. If not changed and touched then it will be skipped. While being smart I deleted it on slave from index dir and then replicated, but segment.gen was not replicated. Due to your explanation NFS could not be reliable any more. So my idea either a bug or a feature and the experts will know :-) Regards Bernd Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: using distributed search with the suggest component
Hi Tobias, sadly, it seems you are right. After a little bit investigation we also recognized that some names (we use it for auto-completing author-names), are missing. And since it is a distributed setup ... But I am almost sure it worked with Solr 3.2. Best regards, Sebastian -- View this message in context: http://lucene.472066.n3.nabble.com/using-distributed-search-with-the-suggest-component-tp3197651p3226082.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.3 crashes after ~18 hours?
On Thu, Aug 4, 2011 at 10:08 AM, Yonik Seeley yo...@lucidimagination.com wrote: ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com We're seeing something similar here. Not sure exactly what the circumstances are, but occasionally our Solr 3.3 test instance is hanging, nothing seems to be happening for several minutes. It does seem to be happening while data is being added and continuous queries are being sent. It also may be related to an optimize happening (we attempt to optimize after adding all the new data from our database). The last log message is: 2011-08-04 13:46:56,418 [qtp30604342-451] INFO org.apache.solr.core.SolrCore - [report] webapp= path=/update params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=2} status=0 QTime=109109 Here is our thread dump: 2011-08-04 13:47:16 Full thread dump Java HotSpot(TM) Client VM (20.1-b02 mixed mode): RMI TCP Connection(13)-172.16.10.102 daemon prio=6 tid=0x47a4a400 nid=0x1384 runnable [0x4861f000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked 0x183a55a0 (a java.io.BufferedInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:66) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:517) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - 0x183a7c68 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) qtp30604342-451 prio=6 tid=0x475c4800 nid=0x1a58 waiting on condition [0x4897f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x18214c08 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None qtp30604342-450 prio=6 tid=0x47ad1c00 nid=0x1ca4 waiting on condition [0x49d2f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x18214c08 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None qtp30604342-449 prio=6 tid=0x47a57c00 nid=0xb2c waiting on condition [0x49c2f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x18214c08 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at
Re: MultiSearcher/ParallelSearcher - searching over multiple cores?
Hi Erik, I have several types with different properties, but they are supposed to be combined to one search. Imagine a book with property title and a journal with property name. (the types in my project have of course more complex properties.) So I created a new core with combined searchfields: field name is indexed, title is indexed, some shared properties are indexed like id. Further an additional solr field type is created. Of course there are several indexer, each per type. A specific type indexer stores only the fields of that type and stores further the type information eg book. After indexing, all types are in the same core. To search over all types, the query has to look like that ((title: bla) and (type: book)) or ((name: bla) and (type: journal)). At least you get books or journal sorted by boost factor - and you have the type information as return field to differ the search results. I hope it is coherent. Thanks for your answer, Best Ralf
Re: Is there anyway to sort differently for facet values?
Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it. I will try that though. Can it handle the values below in the correct order? Under 10 10 - 20 20 - 30 Above 30 Or Small Medium Large XL ... My second question is that if Solr can't do that for the values above by using facet.sort. Is there any other ways in Solr? Thanks in advance, YH On Wed, Aug 3, 2011 at 8:35 PM, Erick Erickson erickerick...@gmail.comwrote: have you looked at the facet.sort parameter? The index value is what I think you want. Best Erick On Aug 3, 2011 7:03 PM, Way Cool way1.wayc...@gmail.com wrote: Hi, guys, Is there anyway to sort differently for facet values? For example, sometimes I want to sort facet values by their values instead of # of docs, and I want to be able to have a predefined order for certain facets as well. Is that possible in Solr we can do that? Thanks, YH
What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?
Hi, guys, What's the best way (practice) to do index distribution at this moment? Hadoop? or rsyncd (back to 3 years ago ;-)) ? Thanks, Yugang
Re: Is there anyway to sort differently for facet values?
No, it can not. It just sorts alphabetically, actually by raw byte-order. No other facet sorting functionality is available, and it would be tricky to implement in a performant way because of the way lucene works. But it would certainly be useful to me too if someone could figure out a way to do it. On 8/4/2011 2:43 PM, Way Cool wrote: Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it. I will try that though. Can it handle the values below in the correct order? Under 10 10 - 20 20 - 30 Above 30 Or Small Medium Large XL ... My second question is that if Solr can't do that for the values above by using facet.sort. Is there any other ways in Solr? Thanks in advance, YH On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote: have you looked at the facet.sort parameter? The index value is what I think you want. Best Erick On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com wrote: Hi, guys, Is there anyway to sort differently for facet values? For example, sometimes I want to sort facet values by their values instead of # of docs, and I want to be able to have a predefined order for certain facets as well. Is that possible in Solr we can do that? Thanks, YH
Re: lucene/solr, raw indexing/searching
I have decided to use solr for indexing as well. the types of searches im doing are professional/academic. so for example, i need to match: all over the following exactly from my source data: 3.56, 4 harv. l. rev. 45, 187-532, 3 llm 56, 5 unts 8, 6 u.n.t.s. 78, father's obligation i seem to keep running into issues getting this to work. the searching is being done on a text field that is not stored. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?
I'm not sure what you mean by index distribution, that could possibly mean several things. But Solr has had a replication feature built into it from 1.4, that can probably handle the same use cases as rsync, but better. So that may be what you want. There are certainly other experiments going on involving various kinds of scaling distribution, that I'm not familiar with, including the sharding feature, that I'm not very familiar with. I don't know if anyone's tried to do anything with hadoop. On 8/4/2011 2:52 PM, Way Cool wrote: Hi, guys, What's the best way (practice) to do index distribution at this moment? Hadoop? or rsyncd (back to 3 years ago ;-)) ? Thanks, Yugang
Re: lucene/solr, raw indexing/searching
It depends. Okay, the source contains 4 harv. l. rev. 45 . Do you want a user entered harv. to ALSO match harv (without the period) in source, and vice versa? Or do you require it NOT match? Or do you not care? The default filter analysis chain will index 4 harv. l. rev. 45 essentially as 4;harv;l;rev;45. A phrase search for 4 harv. l. rev. 45 will match it, but so will a phrase search for 4 harv l rev 45 , and in fact so will a phrase search for 4 harv. l. rev45 That could be good, or it could be bad. The point of the Solr analysis chain is to apply tokenization and transformation at both index time and query time, so queries will match source in the way you want. You can customize this analysis chain however you want, in extreme cases even writing your own analyzers in Java. If the out of the box default isn't doing what you want, you'll have to spend some time thinking about how an inverted index like lucene works, and what you want. You would need to provide a lot more specifications/details for someone else to figure out what analysis chain will do what you want, but I bet you can figure it our yourself after reading up a bit and thinking up a bit. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters On 8/4/2011 4:30 PM, dhastings wrote: I have decided to use solr for indexing as well. the types of searches im doing are professional/academic. so for example, i need to match: all over the following exactly from my source data: 3.56, 4 harv. l. rev. 45, 187-532, 3 llm 56, 5 unts 8, 6 u.n.t.s. 78, father's obligation i seem to keep running into issues getting this to work. the searching is being done on a text field that is not stored. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there anyway to sort differently for facet values?
It can be achieved by creating own (app specific) custom comparators for fields defined in schema.xml and having an extra attribute to specify the comparator class in the field tag itself. But it will require changes in the Solr to support this feature. (Not sure if it's feasible though just throwing an idea.) -param On 8/4/11 4:29 PM, Jonathan Rochkind rochk...@jhu.edu wrote: No, it can not. It just sorts alphabetically, actually by raw byte-order. No other facet sorting functionality is available, and it would be tricky to implement in a performant way because of the way lucene works. But it would certainly be useful to me too if someone could figure out a way to do it. On 8/4/2011 2:43 PM, Way Cool wrote: Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it. I will try that though. Can it handle the values below in the correct order? Under 10 10 - 20 20 - 30 Above 30 Or Small Medium Large XL ... My second question is that if Solr can't do that for the values above by using facet.sort. Is there any other ways in Solr? Thanks in advance, YH On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote: have you looked at the facet.sort parameter? The index value is what I think you want. Best Erick On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com wrote: Hi, guys, Is there anyway to sort differently for facet values? For example, sometimes I want to sort facet values by their values instead of # of docs, and I want to be able to have a predefined order for certain facets as well. Is that possible in Solr we can do that? Thanks, YH
Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?
Yes, I am talking about replication feature. I remember I tried rsync 3 years ago with solr 1.2. Just not sure if someone else have done anything better than that during the last 3 years. ;-) Personally I am thinking about using Hadoop and ZooKeeper. Has anyone tried those features? I found a couple links below, but no success on that yet. http://wiki.apache.org/solr/SolrCloud http://wiki.apache.org/solr/DeploymentofSolrCoreswithZookeeper Thanks for your reply Jonathan. On Thu, Aug 4, 2011 at 2:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I'm not sure what you mean by index distribution, that could possibly mean several things. But Solr has had a replication feature built into it from 1.4, that can probably handle the same use cases as rsync, but better. So that may be what you want. There are certainly other experiments going on involving various kinds of scaling distribution, that I'm not familiar with, including the sharding feature, that I'm not very familiar with. I don't know if anyone's tried to do anything with hadoop. On 8/4/2011 2:52 PM, Way Cool wrote: Hi, guys, What's the best way (practice) to do index distribution at this moment? Hadoop? or rsyncd (back to 3 years ago ;-)) ? Thanks, Yugang
deleting index directory/files
Hello all I'm using multiple cores. I there's a directory named by the core and it contains a subdir named data that contains a subdir named index that contains a bunch of files that contain the data for my index. Let's say I want to completely rebuild the index from scratch. Can I delete the dir named index? I know the next thing I'd have to do is a full data import, and that's ok. I want to blow away any traces of the core's previous existence. Mark
RE: deleting index directory/files
I ran into a problem when I deleted just the index directory; I deleted the entire data directory and it was recreated on the next load. BTW, if you're using the DIH, its default behavior is to remove all records on a full import, so you can save yourself having to remove any actual files. -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, August 04, 2011 4:01 PM To: solr-user@lucene.apache.org Subject: deleting index directory/files Hello all I'm using multiple cores. I there's a directory named by the core and it contains a subdir named data that contains a subdir named index that contains a bunch of files that contain the data for my index. Let's say I want to completely rebuild the index from scratch. Can I delete the dir named index? I know the next thing I'd have to do is a full data import, and that's ok. I want to blow away any traces of the core's previous existence. Mark DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Json update using HttpURLConnection
I am trying to post the json update request using java.net.HttpURLConnection. Parameters I am using: url : http://localhost:8983/solr/update/json?commit=true String data = [{\id\ : \TestDoc7\, \title\ : \test 7\}, {\id\ : \TestDoc8\, \title\ : \another test 8\}]; uri += + data; String requestType = POST; Header info: Content-type, application/json Content-Length, + data.length() I see, the request going to the solr server but the data is not being persisted. I was able to add the same documents with curl, I am sure there is something stupid. Any pointers on what might be the mistake. Thanks, Sharath
Re: Minimum Score
Hi, I am using Solr 3.1 with the SolrJ client library. I can see that it is possible to get the maximum score for your search by using the following: response.getResults().getMaxScore() I am wondering is there some simple solution to get the minimum score? Many thanks.
SOLR Support for Span Queries
How does one issue span queries in SOLR (Span, SpanNear, etc)? I've done a bit of research and it seems that these are not supported. It would seem that I need to implement a QueryParserPlugin to accomplish this. Is this the correct path? Surely this has been done before. Does anybody have links to examples? I had trouble finding anything. Thanks! Josh Harness
Re: Records skipped when using DataImportHandler
Ok. After analysis, I narrowed the reduced results set to the fact that the zipcode field is not indexed 'as is'. i.e the zipcode field values are broken down into tokens and then stored. Hence, if there are 10 documents with zipcode fields varying from 91000-91009, then the zipcode fields are not stored as 91000, 91001 etc.. instead, the most common recurrences are grabbed together and stored as tokens hence resulting in a reduced resultset. The net effect is I cannot search for a value like 91000 since its not stored as it is. I suspect this to do something with the type of field the zipcode is associated to. Right now , zipcode is a field of type text_general where the StandardTokenizerFactory may be breakign the values into tokens. However, I want to store them without tokenizing. Whats the best field type to do this. ? I already explored the String fieldtype which is supposed to store the values as is, but I see that the values are still being tokenized. Thanks, Anand On Wed, Aug 3, 2011 at 7:24 PM, Erick Erickson erickerick...@gmail.comwrote: Sorry, I'm on a restricted machine so can't get the precise URL. But, there's a debug page for DIH that might allow you to see what the query actually returns. I'd guess one of two things: 1 you aren't getting the number of rows you think. 2 you aren't committing the documents you add. But that's just a guess. Best Erick On Aug 3, 2011 2:15 PM, anand sridhar anand.for...@gmail.com wrote: Hi, I am a newbie to Solr and have been trying to learn using DataImportHandler. I have a query in data-config.xml that fetches about 5 records when i fire it in SQL Query manager. However, when Solr does a full import, it is skipping 4 records and only importing 1 record. What could be the reason for that. ? My data-config.xml looks like this - dataConfig dataSource type=JdbcDataSource name=GeoService driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup user=sa password=psiuser/ document entity name=city query=select ll.cityId as id, ll.zip as zipCode, c.cityName as cityName, st.stateName as state, ct.countryName as country from latlonginfo ll,city c, state st, country ct where ll.cityId = c.cityID and c.stateID=st.stateID and st.countryID = ct.countryID order by ll.areacode dataSource=GeoService field column=zipCode name=zipCode/ field column=cityName name=cityName/ field column=state name=state/ field column=country name=country/ /entity /document /dataConfig My fields definition in schema.xml looks as below - field name=CityName type=text_general indexed=true stored=true / field name=zipCode type=text_general indexed=true stored=true/ field name=state type=text_general indexed=true stored=true / field name=country type=text_general indexed=true stored=true / One observation I made was the 1 record that is being indexes is the last record in the result set. I have verified that there are no duplicate records being retreived. For eg, if the result set from Database is - zipcode CityName state country --- - - --- 91324 Northridge CA USA 91325 Northridge CA USA 91327 Northridge CA USA 91328 Northridge CA USA 91329 Northridge CA USA 91330 Northridge CA USA The record being indexed is the last record all the time. Any suggestions are welcome. Thanks, Anand
Re: Minimum Score
Off the top of my head you maybe you can get the number of results and then look at the last document and check its score. I believe the results will be ordered by score? On 08/04/2011 05:44 PM, Kissue Kissue wrote: Hi, I am using Solr 3.1 with the SolrJ client library. I can see that it is possible to get the maximum score for your search by using the following: response.getResults().getMaxScore() I am wondering is there some simple solution to get the minimum score? Many thanks.
Re: Json update using HttpURLConnection
Never mind, It was some stupid bug. Figured it out. Cheers, Sharath On Thu, Aug 4, 2011 at 2:35 PM, Sharath Jagannath shotsonclo...@gmail.comwrote: I am trying to post the json update request using java.net.HttpURLConnection. Parameters I am using: url : http://localhost:8983/solr/update/json?commit=true String data = [{\id\ : \TestDoc7\, \title\ : \test 7\}, {\id\ : \TestDoc8\, \title\ : \another test 8\}]; uri += + data; String requestType = POST; Header info: Content-type, application/json Content-Length, + data.length() I see, the request going to the solr server but the data is not being persisted. I was able to add the same documents with curl, I am sure there is something stupid. Any pointers on what might be the mistake. Thanks, Sharath
Loading huge synonym list in Solr
Hello, I would like to know the best way to load a huge synonym list into Solr. I would like to do concept indexing (a.k.a category indexing) with Solr. For example, I want to be able to index all cities and be able to search for all of them using a special keyword, say 'CONCEPTcity', where 'CONCEPTcity' will match anything that IS-A city, as specified in the index_synonyms.txt file. I believe the best way to do this is via the SynonymFilterFactory and do index-time synonym expansion. Or is there a better alternative? I would still like to keep the original city names and do not want to replace them with 'CONCEPTcity', so if someone searches for 'Lake', the city name 'Salt Lake City' still matches. Also, obviously, I do not want two different city names to be synonyms of each other. Is the correct way to specify the index_synonyms.txt file like this? - CONCEPTcity, Salt Lake City CONCEPTcity, New York CONCEPTcity, San Jose . . . - and then keep expand=true for SynonymFilterFactory? I tried to load a synonym file with 10K entries like this, and Solr/Jetty took a few seconds to start, but if I try to load a synonym file with 1M+ entries, then it is taking a long time. What is the best way to do this? Thanks, Arun.
Re: Loading huge synonym list in Solr
https://issues.apache.org/jira/browse/LUCENE-3233 On Thu, Aug 4, 2011 at 7:24 PM, Arun Atreya my.2.pai...@gmail.com wrote: Hello, I would like to know the best way to load a huge synonym list into Solr. I would like to do concept indexing (a.k.a category indexing) with Solr. For example, I want to be able to index all cities and be able to search for all of them using a special keyword, say 'CONCEPTcity', where 'CONCEPTcity' will match anything that IS-A city, as specified in the index_synonyms.txt file. I believe the best way to do this is via the SynonymFilterFactory and do index-time synonym expansion. Or is there a better alternative? I would still like to keep the original city names and do not want to replace them with 'CONCEPTcity', so if someone searches for 'Lake', the city name 'Salt Lake City' still matches. Also, obviously, I do not want two different city names to be synonyms of each other. Is the correct way to specify the index_synonyms.txt file like this? - CONCEPTcity, Salt Lake City CONCEPTcity, New York CONCEPTcity, San Jose . . . - and then keep expand=true for SynonymFilterFactory? I tried to load a synonym file with 10K entries like this, and Solr/Jetty took a few seconds to start, but if I try to load a synonym file with 1M+ entries, then it is taking a long time. What is the best way to do this? Thanks, Arun. -- lucidimagination.com
Copy Fields while Replication
Hi I would like to know whether i can add new fields while replicating index on Slave. E.g. My Master has index with field F1 which is created with type string. Now, i don't want F1 as a type string also have limitation that i cannot change the field type at schema level. Now, if i replicate that index on Slave, can i use copyField attribute to create a copy of field F1 into field F2. F2 will be my field with type text which i can use as per my requirement Please suggest Thanks Pawan
Solr DIH import - Date Question
This is perhaps a 'truly newbie' question. I am processing some files via DIH handler/XPATH Processor. Some of the date fields in the XML are in 'Java Long format' i.e. just a big long number. I am wondering how to map them Solr Date field. I used the DIH DateFormatTransformer for some other 'date' fields that were written out in a regular date format. However I am stumped on this - thought it would be simple but I was not able to find a solution Any help would be much appreciated -g -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-DIH-import-Date-Question-tp3227720p3227720.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr DIH import - Date Question
You might have to do this with an external script. The DIH lets you process fields with javascript or Groovy. Also, somewhere in the DIH you can give an XSL stylesheet instead of just an XPath. On Thu, Aug 4, 2011 at 10:31 PM, solruser@9913 gunaranj...@yahoo.com wrote: This is perhaps a 'truly newbie' question. I am processing some files via DIH handler/XPATH Processor. Some of the date fields in the XML are in 'Java Long format' i.e. just a big long number. I am wondering how to map them Solr Date field. I used the DIH DateFormatTransformer for some other 'date' fields that were written out in a regular date format. However I am stumped on this - thought it would be simple but I was not able to find a solution Any help would be much appreciated -g -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-DIH-import-Date-Question-tp3227720p3227720.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com