Re: How to change tmp directory
Ah - allright, that's it! Thank you! Erik Am 04.07.2012 um 17:59 schrieb Jack Krupansky: Solr is probably simply using Java's temp directory, which you can redefine by setting the java.io.tmpdir system property on the java command line or using a system-specific environment variable. -- Jack Krupansky -Original Message- From: Erik Fäßler Sent: Wednesday, July 04, 2012 3:56 AM To: solr-user@lucene.apache.org Subject: How to change tmp directory Hello all, I came about an odd issue today when I wanted to add ca. 7M documents to my Solr index: I got a SolrServerException telling me No space left on device. I had a look at the directory Solr (and its index) is installed in and there is plenty space (~300GB). I then noticed a file named upload_457ee97b_1385125274b__8000_0005.tmp had taken up all space of the machine's /tmp directory. The partition holding the /tmp directory only has around 1GB of space and this file already took nearly 800MB. I had a look at it and I realized that the file contained the data I was adding to Solr in an XML format. Is there a possibility to change the temporary directory for this action? I use an IteratorSolrInputDocument with the HttpSolrServer's add(Iterator) method for performance. So I can't just do commits from time to time. Best regards, Erik
How to change tmp directory
Hello all, I came about an odd issue today when I wanted to add ca. 7M documents to my Solr index: I got a SolrServerException telling me No space left on device. I had a look at the directory Solr (and its index) is installed in and there is plenty space (~300GB). I then noticed a file named upload_457ee97b_1385125274b__8000_0005.tmp had taken up all space of the machine's /tmp directory. The partition holding the /tmp directory only has around 1GB of space and this file already took nearly 800MB. I had a look at it and I realized that the file contained the data I was adding to Solr in an XML format. Is there a possibility to change the temporary directory for this action? I use an IteratorSolrInputDocument with the HttpSolrServer's add(Iterator) method for performance. So I can't just do commits from time to time. Best regards, Erik
Stats Component and solrj
Hey all, I'd like to know how many terms I have in a particular field in a search. In other words, I want to know how many facets I have in that field. I use string fields, there are no numbers. I wanted to use the Stats Component and use its count value. When trying this out in the browser, everything works like expected. However, when I want to do the same thing in my Java web app, I get an error because in FieldStatsInfo.class it says min = (Double)entry.getValue(); Where 'entry.getValue()' is a String because I have a string field here. Thus, I get an error that String cannot be cast to Double. In the browser I just got a String returned here, probably relative to an lexicographical order. I switched the Stats Component on with query.setGetFieldStatistics(authors); Where 'authors' is a field with author names. Is it possible that solrj not yet works with the Stats Component on string fields? I tried Solr 3.5 and 3.6 without success. Is there another easy way to get the count I want? Will solrj be fixed? Or am I just doing an error? Best regards, Erik
Re: Field length and scoring
Ahh, that's it - I thought of such a thing but couldn't find a proper affirmation with Google. Thank you both for your answers. I guess I will just sort by value length myself. Only one thing: Erick said my examples would both be one token long. But I rather think, there are both one value long but three and four tokens long, as the NGramAnalyzer splits the values in smaller tokens. And as it can be seen from the link given by Ahmet, field lengths of three and four are not distinguished - where the reason for my observation lies. Thanks again and best regards, Erik On 24.03.2012, at 00:02, Ahmet Arslan iori...@yahoo.com wrote: Also, the field length is enocded in a byte (as I remember). So it's quite possible that, even if the lengths of these fields were 3 and 4 instead of both being 1, the value stored for the length norms would be the same number. Exactly. http://search-lucene.com/m/uGKRu1pvRjw
Field length and scoring
Hello there, I have a quite basic question but my Solr is behaving in a way I'm not quite sure of why it does so. The setup is simple: I have a field suggestionText in which single strings are indexed. Schema: field name=suggestionText type=prefixNGram indexed=true stored=true/ Since I want this field to serve for a suggestion-search, the input string is analyzed by a EdgeNGramFilter. Lets have a look on two cases: case1: Input string was 'il2' case2: Input string was 'il24' As I can see from the Solr-admin-analysis-page, case1 is analysed as i il il2 and case2 as i il il2 il24 As you would expect. The point now is: When I search for 'il2' I would expect case1 to have a higher score than case2. I thought this way because I did not omit norms and thus I thought, the shorter field would get a (slightly) higher score. However, the scores in both cases are identical and so it happens that 'il24' is suggested prior to 'il2'. Perhaps I did understand the norms or the notion of field length wrong. I would be grateful if you could help me out here and give me advice on how to accomplish the wished behavior. Thanks and best regards, Erik
Uncomplete date expressions
Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Re: Uncomplete date expressions
Hello François, thank you for your quick reply. I thought about just storing which information I am lacking and this would be a possibility of course. It just seemed a bit like quickdirty to me and I wondered whether Solr really cannot understand dates which only consist of the year. Isn't it a common case that a date/time expression is not determined to the hour, for example? But if there is no other possibility I will stick with your suggestion, thank you! Best, Erik Am 29.10.2011 um 15:20 schrieb François Schiettecatte: Erik I would complement the date with default values as you suggest and store a boolean flag indicating whether the date was complete or not, or store the original date if it is not complete which would probably be better because the presence of that data would tell you that the original date was not complete and you would also have it too. Cheers François On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote: Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Facetting: Some questions concerning method:fc
Hey all! I have a few questions concerning the field cache method for faceting. The wiki says for enum method: This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. . And for fc method: This was the default method for single valued fields prior to Solr 1.4. . I just ran into the problem of using fc for a field which can have multiple terms for one field. The facet counts would be wrong, seemingly only counting the first term in the field of each document. I observed this in Solr 1.4.1 and in 3.1 with the same index. Question 1: The quotes above say prior to Solr 1.4. Has this changed? Is there another method for multi-valued faceting since Solr 1.4? Question 2: Very weird is another observation: When faceting on another field, namely the text field holding a large variety of terms and especially a lot of different terms in one single field, the fc method seems to count everything correctly. In fact, the results between fc and enum don't seem to differ. The field in which the fc and enum faceting results differ consists of a lot of terms which have all start- end end offsets 0, 0 and position increment 1. Could this be a problem? Best regards, Erik
Re: Facetting: Some questions concerning method:fc
Am 19.05.2011 16:07, schrieb Yonik Seeley: On Thu, May 19, 2011 at 9:56 AM, Erik Fäßlererik.faess...@uni-jena.de wrote: I have a few questions concerning the field cache method for faceting. The wiki says for enum method: This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. . And for fc method: This was the default method for single valued fields prior to Solr 1.4. . I just ran into the problem of using fc for a field which can have multiple terms for one field. The facet counts would be wrong, seemingly only counting the first term in the field of each document. I observed this in Solr 1.4.1 and in 3.1 with the same index. That doesn't sound right... the results should always be identical between facet.method=fc and facet.method=enum. Are you sure you didn't index a multi-valued field and then change the fieldType in the schema to be single valued? Are you sure the field is indexed the way you think it is? If so, is there an easy way for someone to reproduce what you are seeing? -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco Thanks a lot for your help: Changing the field type to multiValued did the trick. The point is, I built the index using Lucene directly (I need to for some special manipulation of offsets and position increments). So my question is which requirements a Lucene field has to fulfill so Solr's faceting works correctly. Particular question: In Lucene terms, what exactly is denoted by a multiValued field? I thought that would result in multiple Lucene Field instances with the same name for a single document. But I think my field has only one instance per document (but I could check that back). Thanks again for your quick and helpful answer! Erik
Re: SOLR 1.4.1 : Indexing DateField time zone problem
Hm - but I observed this, too. And I didn't do anything with SQL at all. I was parsing date strings out of XML, creating a string which could be formatted using DIH's DateFormatTransformer. But the indexed dates have been a few hours too early in my case, switching back the dates to one day before. I didn't go deeply into this, I think I was experiencing a conversion of my dates strings from my time zone into UTC. My quick solution was to write another version of the DateFormatTransformer which takes a timeZone attribute. This way, the date strings shown in the indexed documents showed the correct date (which was what I wanted). But I guess doing it this way also wasn't then best solution because when using date range math I ran into other time zone conversion problems, due to my own conversions earlier I think. But until now I didn't go deeper into this so I don't know the exact reasons (although I'm sure it's not really a too challenging problem) and I haven't done a solution yet. Best regards, Erik Am 25.11.2010 um 18:04 schrieb Erick Erickson erickerick...@gmail.com: I don't believe this is a Solr issue at all. I suspect your MySql query is doing the timezone change. Solr doesn't apply any processing to the date, it doesn't need to because times are all Zulu. There's a little known debug console for DIH, see: http://wiki.apache.org/solr/DataImportHandler#interactive http://wiki.apache.org/solr/DataImportHandler#interactivethat might help a lot. I think what you need to do is apply a transformation in your SQL statement to get times in UTC, somthing like CONVERT_TZ or some such, see: http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_convert-tz Best Erick On Thu, Nov 25, 2010 at 5:27 AM, Shanmugavel SRD srdshanmuga...@gmail.comwrote: I am using SOLR 1.4.1. My SOLR runs in a server which is in EST zone. I am trying to index a date field which is in MySQL as '2007-08-08T05:36:50Z' but while indexing it becomes '2007-08-08T09:36:50Z' where 4 hours got increased. But I want the date as is while indexing, means, after indexing I want the value as '2007-08-08T05:36:50Z' in 'modified_d' field. Can anyone help me on this? field column=post_modified name=modified_d dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / I searched in this forum and there are discussions on this same problem but on SOLR 1.3, that's why I am posting this query again. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-1-4-1-Indexing-DateField-time-zone-problem-tp1966118p1966118.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH full-import failure, no real error message
Hello Erick, I guess I'm the one asking for pardon - but sure not you! It seems, you're first guess could already be the correct one. Disc space IS kind of short and I believe it could have run out; since Solr is performing a rollback after the failure, I didn't notice (beside the fact that this is one of our server machine, but apparently the wrong mount point...). I not yet absolutely sure of this, but it would explain a lot and it really looks like it. So thank you for this maybe not so obvious hint :) But you also mentioned the merging strategy. I left everything on the standards that come with the Solr download concerning these things. Could it be that such a great index needs another treatment? Could you point me to a Wiki page or something where I get a few tips? Thanks a lot, I will try building the index on a partition with enough space, perhaps that will already do it. Best regards, Erik Am 16.11.2010 14:19, schrieb Erick Erickson: Several questions. Pardon me if they're obvious, but I've spent fr too much of my life overlooking the obvious... 1 Is it possible you're running out of disk? 40-50G could suck up a lot of disk, especially when merging. You may need that much again free when a merge occurs. 2 speaking of merging, what are your merge settings? How are you triggering merges. SeemergeFactor and associated in solrconfig.xml? 3 You might get some insight by removing the Solr indexing part, can you spin through your parsing from beginning to end? That would eliminate your questions about whether you're XML parsing is the problem. 40-50G is a large index, but it's certainly within Solr's capability, so you're not hitting any built-in limits. My first guess would be that you're running out of disk, at least that's the first thing I'd check next... Best Erick On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.dewrote: Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline ( www.pubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this: field name=pmid type=string indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true/ field name=xml type=text indexed=true stored=true/ uniqueKeypmid/uniqueKey defaultSearchFieldpmid/defaultSearchField pmid holds the ID, data hold the creation date; xml holds the whole XML document (mostly below 5kb). I used the DataImporter to do this. I had to write some classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, the error could lie there. What happens is that indexing just looks fine at the beginning. Memory usage is quite below the maximum (max of 20g, usage of below 5g, most of the time around 3g). It goes several hours in this manner until it suddenly stopps. I tried this a few times with minor tweaks, non of which made any difference. The last time such a crash occurred, over 16.5 million documents already had been indexed (argh, so close...). It never stops at the same document and trying to index the documents, where the error occurred, just runs fine. Index size on disc was between 40g and 50g the last time I had a look. This is the log from beginning to end: (I decided to just attach the log for the sake of readability ;) ). As you can see, Solr's error message is not quite complete. There are no closing brackets. The document is cut in half on this message and not even the error message itself is complete: The 'D' of (D)ataImporter.runCmd(DataImporter.java:389) right after the document text is missing. I have one thought concerning this: I get the input documents as an InputStream which I read buffer-wise (at most 1000bytes per read() call). I need to deliver the documents in one large byte-Array to the XML parser I use (VTD XML). But I don't only get the individual small XML documents but always one larger XML blob with exactly 30,000 of these documents. I use a self-written EntityProcessor to extract the single documents from the larger blob. These blobs have a size of about 50 to 150mb. So what I do is to read these large blobs in 1000bytes steps and store each byte array in an ArrayListbyte[]. Afterwards, I create the ultimate byte[] and do System.arraycopy from the ArrayList into the byte[]. I tested this and it looks fine to me. And how I said, indexing the documents where the error occurred just works fine (that is, indexing the whole blob containing the single document). I just mention this because it kind of looks like there is this cut in the document and the missing 'D' reminds me of char-encoding errors. But I don't know for real, opening the error log in vi doesn't show any broken characters (the last time I had such problems, vi could identify the characters in question, other editors just wouldn't show them). Further ideas from my side: Is
Re: DIH full-import failure, no real error message
Yes, I noticed just after sending the message. My apologies! Best, Erik Am 20.11.2010 um 00:32 schrieb Chris Hostetter hossman_luc...@fucit.org: : Subject: DIH full-import failure, no real error message : References: aanlktinqsw22n0vj7at3nbx4=ocmdesjq=q0y+rbp...@mail.gmail.com : In-Reply-To: aanlktinqsw22n0vj7at3nbx4=ocmdesjq=q0y+rbp...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
Re: DIH full-import failure, no real error message
Yes, I knew index and storing would pose a heavy load but I wanted to give it a try. The storing has to be for the goal I'd like to archive. We use a UIMA NLP-Pipeline to process the Medline documents and we already have a Medline-XML reader. Everything's fine with all this except until now we just stored every single XML document on disc and saved all the paths of the exact documents we wanted to process on a particular run in a database. Then, our UIMA CollectionReader would retrieve a batch of file paths from the database, read the files and process them. This worked fine and it still will - but importing into the database can take quite a long time because we have to traverse the file system tree for the correct files. We arranged the files so we can find them more easily. But still, extracting all the individual files from the larger XML blobs takes to much time and Inodes ;) This is why I'm doing a Solr index (nice benefit here: I could implement search) and - as an alternative - store them in a database for retrieval; I will experiment with both solutions and check out which better fulfills my needs. But until this point it is necessary to retrieve the full documents, otherwise I'd have to re-evaluate and partly rewrite our UIMA-Pipelines. Perhaps this will be the way to go, but this would be really time consuming and I'd only do this if there are great benefits. It seems, David's solution would be ideal for us; perhaps I will have a read on the cloud-branch, and HBase in particular. But - as long Solr can take the effort of storing the whole XML documents - of course I can switch the indexing of the XML off. I may need the whole XML for retrieval, but I can identify particular parts of the XML we'd like to search. These can be extracted easily so this is a good idea, of course. Thanks for all your great advices and help, I really appreciate! Best, Erik Am 17.11.2010 01:55, schrieb Erick Erickson: They're not mutually exclusive. Part of your index size is because you *store* the full xml, which means that a verbatim copy of the raw data is placed in the index, along with the searchable terms. Including the tags. This only makes sense if you're going to return the original data to the user AND use the index to hold it. Storing has nothing to do with searching (again, pardon me if this is obvious), which can be confusing. I claim you could reduce the size of your index dramatically without losing any search capability by simply NOT storing the XML blob, just indexing it. But that may not be what you need to do, only you know your problem space. Which brings up the question whether it makes sense to index the XML tags, but again that will be defined by your problem space. If you have a well-defined set of input tags, you could consider indexing each of the tags in a unique field, but the query then gets complicated. I've seen more than a few situations where trying to use a RDBMSs search capabilities doesn't work as the database gets larger, and your's qualifies as larger. In particular, RDBMSs don't have very sophisticated search capabilities, and the speed gets pretty bad. That's OK, because Solr doesn't have very good join capabilities, different tools for different problems. Best Erick On Tue, Nov 16, 2010 at 12:16 PM, Erik Fäßlererik.faess...@uni-jena.dewrote: Thank you very much, I will have a read on your links. The full-text-red-flag is exactly the thing why I'm testing this with Solr. As was said before by Dennis, I could also use a database as long as I don't need sophisticated query capabilities. To be honest, I don't know the performance gap between a Lucene index and a database in such a case. I guess I will have to test it. This is thought as a substitution for holding every single file on disc. But I need the whole file information because it's not clear which information will be required in the future. And we don't want to re-index every time we add a new field (not yet, that is ;)). Best regards, Erik Am 16.11.2010 16:27, schrieb Erick Erickson: The key is that Solr handles merges by copying, and only after the copy is complete does it delete the old index. So you'll need at least 2x your final index size before you start, especially if you optimize... Here's a handy matrix of what you need in your index depending upon what you want to do: http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase Leaving out what you don't use will help by shrinking your index. http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase the thing that jumps out is that you're storing your entire XML document as well as indexing it. Are you expecting to return the document to the user? Storing the entire document is is a red-flag, you probably don't want to do this. If you need to return the entire document some time, one strategy is to index whatever you need
Re: DIH full-import failure, no real error message
Hi Tommaso, I'm not sure I saw exactly that but there was a Solr-UIMA-contribution a few months ago and I had a look. I didn't go into details, because our search engine isn't upgraded to Solr yet (but is to come). But I will keep your link, perhaps this will proof useful to me, thank you! Best regards, Erik Am 17.11.2010 16:25, schrieb Tommaso Teofili: Hi Erik 2010/11/17 Erik Fäßlererik.faess...@uni-jena.de . But until this point it is necessary to retrieve the full documents, otherwise I'd have to re-evaluate and partly rewrite our UIMA-Pipelines. Did you see https://issues.apache.org/jira/browse/SOLR-2129 for enhancing docs with UIMA pipelines just before they get indexed in Solr? Cheers, Tommaso
DIH full-import failure, no real error message
Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this: field name=pmid type=string indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true/ field name=xml type=text indexed=true stored=true/ uniqueKeypmid/uniqueKey defaultSearchFieldpmid/defaultSearchField pmid holds the ID, data hold the creation date; xml holds the whole XML document (mostly below 5kb). I used the DataImporter to do this. I had to write some classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, the error could lie there. What happens is that indexing just looks fine at the beginning. Memory usage is quite below the maximum (max of 20g, usage of below 5g, most of the time around 3g). It goes several hours in this manner until it suddenly stopps. I tried this a few times with minor tweaks, non of which made any difference. The last time such a crash occurred, over 16.5 million documents already had been indexed (argh, so close...). It never stops at the same document and trying to index the documents, where the error occurred, just runs fine. Index size on disc was between 40g and 50g the last time I had a look. This is the log from beginning to end: (I decided to just attach the log for the sake of readability ;) ). As you can see, Solr's error message is not quite complete. There are no closing brackets. The document is cut in half on this message and not even the error message itself is complete: The 'D' of (D)ataImporter.runCmd(DataImporter.java:389) right after the document text is missing. I have one thought concerning this: I get the input documents as an InputStream which I read buffer-wise (at most 1000bytes per read() call). I need to deliver the documents in one large byte-Array to the XML parser I use (VTD XML). But I don't only get the individual small XML documents but always one larger XML blob with exactly 30,000 of these documents. I use a self-written EntityProcessor to extract the single documents from the larger blob. These blobs have a size of about 50 to 150mb. So what I do is to read these large blobs in 1000bytes steps and store each byte array in an ArrayListbyte[]. Afterwards, I create the ultimate byte[] and do System.arraycopy from the ArrayList into the byte[]. I tested this and it looks fine to me. And how I said, indexing the documents where the error occurred just works fine (that is, indexing the whole blob containing the single document). I just mention this because it kind of looks like there is this cut in the document and the missing 'D' reminds me of char-encoding errors. But I don't know for real, opening the error log in vi doesn't show any broken characters (the last time I had such problems, vi could identify the characters in question, other editors just wouldn't show them). Further ideas from my side: Is the index too big? I think I read something about a large index would be something around 10million documents, I aim to approximately double this number. But would this cause such an error? In the end: What exactly IS the error? Sorry for the lot of text, just trying to describe the problem as detailed as possible. Thanks a lot for reading and I appreciate any ideas! :) Best regards, Erik 15.11.2010 11:08:22 org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1289465394071 15.11.2010 18:16:06 org.apache.solr.handler.dataimport.SolrWriter upload WARNUNG: Error creating document : SolrInputDocument[{pmid=pmid(1.0)={8817856}, xml=xml(1.0)={MedlineCitation Owner=NLM Status=MEDLINE PMID8817856/PMID DateCreated Year1996/Year Month12/Month Day04/Day /DateCreated DateCompleted Year1996/Year Month12/Month Day04/Day /DateCompleted DateRevised Year2004/Year Month11/Month Day17/Day /DateRevised Article PubModel=Print Journal ISSN IssnType=Print0042-4900/ISSN JournalIssue CitedMedium=Print Volume138/Volume Issue26/Issue PubDate Year1996/Year MonthJun/Month Day29/Day /PubDate /JournalIssue TitleThe Veterinary record/Title ISOAbbreviationVet. Rec./ISOAbbreviation /Journal ArticleTitleRestoring confidence in beef: towards a European solution./ArticleTitle Pagination MedlinePgn631-2/MedlinePgn /Pagination Languageeng/Language PublicationTypeList PublicationTypeNews/PublicationType /PublicationTypeList /Article MedlineJournalInfo CountryENGLAND/Country MedlineTAVet Rec/MedlineTA NlmUniqueID0031164/NlmUniqueID ISSNLinking0042-4900/ISSNLinking /MedlineJournalInfo CitationSubsetIM/CitationSubset MeshHeadingList MeshHeading DescriptorName MajorTopicYN=NAnimals/DescriptorName /MeshHeading MeshHeading DescriptorName MajorTopicYN=NCattle/DescriptorName /MeshHeading MeshHeading DescriptorName MajorTopicYN=NCommerce/DescriptorName
Re: DIH full-import failure, no real error message
Retrieval by ID would only be one possible case; I'm still at the beginning of the project, I imagine adding more fields for more complicated queries in the future. I imagine a where - like query over all the XML documents stored in a DBMS wouldn't be too performant ;) And at a later stage I will process all these documents and add lots of metadata - then by latest, I will need a Lucene Index rather than a database. So I'd by interested in solution ideas to my issue all the same. Regards, Erik Am 16.11.2010 11:35, schrieb Dennis Gearon: Wow, if all you want is to retrieve by ID, a database would be fine, even a NO SQL database. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Erik Fäßlererik.faess...@uni-jena.de To: solr-user@lucene.apache.org Sent: Tue, November 16, 2010 12:33:28 AM Subject: DIH full-import failure, no real error message Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this: field name=pmid type=string indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true/ field name=xml type=text indexed=true stored=true/ uniqueKeypmid/uniqueKey defaultSearchFieldpmid/defaultSearchField pmid holds the ID, data hold the creation date; xml holds the whole XML document (mostly below 5kb). I used the DataImporter to do this. I had to write some classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, the error could lie there. What happens is that indexing just looks fine at the beginning. Memory usage is quite below the maximum (max of 20g, usage of below 5g, most of the time around 3g). It goes several hours in this manner until it suddenly stopps. I tried this a few times with minor tweaks, non of which made any difference. The last time such a crash occurred, over 16.5 million documents already had been indexed (argh, so close...). It never stops at the same document and trying to index the documents, where the error occurred, just runs fine. Index size on disc was between 40g and 50g the last time I had a look. This is the log from beginning to end: (I decided to just attach the log for the sake of readability ;) ). As you can see, Solr's error message is not quite complete. There are no closing brackets. The document is cut in half on this message and not even the error message itself is complete: The 'D' of (D)ataImporter.runCmd(DataImporter.java:389) right after the document text is missing. I have one thought concerning this: I get the input documents as an InputStream which I read buffer-wise (at most 1000bytes per read() call). I need to deliver the documents in one large byte-Array to the XML parser I use (VTD XML). But I don't only get the individual small XML documents but always one larger XML blob with exactly 30,000 of these documents. I use a self-written EntityProcessor to extract the single documents from the larger blob. These blobs have a size of about 50 to 150mb. So what I do is to read these large blobs in 1000bytes steps and store each byte array in an ArrayListbyte[]. Afterwards, I create the ultimate byte[] and do System.arraycopy from the ArrayList into the byte[]. I tested this and it looks fine to me. And how I said, indexing the documents where the error occurred just works fine (that is, indexing the whole blob containing the single document). I just mention this because it kind of looks like there is this cut in the document and the missing 'D' reminds me of char-encoding errors. But I don't know for real, opening the error log in vi doesn't show any broken characters (the last time I had such problems, vi could identify the characters in question, other editors just wouldn't show them). Further ideas from my side: Is the index too big? I think I read something about a large index would be something around 10million documents, I aim to approximately double this number. But would this cause such an error? In the end: What exactly IS the error? Sorry for the lot of text, just trying to describe the problem as detailed as possible. Thanks a lot for reading and I appreciate any ideas! :) Best regards, Erik
Re: DIH full-import failure, no real error message
Thank you very much, I will have a read on your links. The full-text-red-flag is exactly the thing why I'm testing this with Solr. As was said before by Dennis, I could also use a database as long as I don't need sophisticated query capabilities. To be honest, I don't know the performance gap between a Lucene index and a database in such a case. I guess I will have to test it. This is thought as a substitution for holding every single file on disc. But I need the whole file information because it's not clear which information will be required in the future. And we don't want to re-index every time we add a new field (not yet, that is ;)). Best regards, Erik Am 16.11.2010 16:27, schrieb Erick Erickson: The key is that Solr handles merges by copying, and only after the copy is complete does it delete the old index. So you'll need at least 2x your final index size before you start, especially if you optimize... Here's a handy matrix of what you need in your index depending upon what you want to do: http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase Leaving out what you don't use will help by shrinking your index. http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCasethe thing that jumps out is that you're storing your entire XML document as well as indexing it. Are you expecting to return the document to the user? Storing the entire document is is a red-flag, you probably don't want to do this. If you need to return the entire document some time, one strategy is to index whatever you need to search, and index what you need to fetch the document from an external store. You can index the values of selected tags as fields in your documents. That would also give you far more flexibility when searching. Best Erick On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßlererik.faess...@uni-jena.dewrote: Hello Erick, I guess I'm the one asking for pardon - but sure not you! It seems, you're first guess could already be the correct one. Disc space IS kind of short and I believe it could have run out; since Solr is performing a rollback after the failure, I didn't notice (beside the fact that this is one of our server machine, but apparently the wrong mount point...). I not yet absolutely sure of this, but it would explain a lot and it really looks like it. So thank you for this maybe not so obvious hint :) But you also mentioned the merging strategy. I left everything on the standards that come with the Solr download concerning these things. Could it be that such a great index needs another treatment? Could you point me to a Wiki page or something where I get a few tips? Thanks a lot, I will try building the index on a partition with enough space, perhaps that will already do it. Best regards, Erik Am 16.11.2010 14:19, schrieb Erick Erickson: Several questions. Pardon me if they're obvious, but I've spent fr too much of my life overlooking the obvious... 1 Is it possible you're running out of disk? 40-50G could suck up a lot of disk, especially when merging. You may need that much again free when a merge occurs. 2 speaking of merging, what are your merge settings? How are you triggering merges. SeemergeFactor and associated in solrconfig.xml? 3 You might get some insight by removing the Solr indexing part, can you spin through your parsing from beginning to end? That would eliminate your questions about whether you're XML parsing is the problem. 40-50G is a large index, but it's certainly within Solr's capability, so you're not hitting any built-in limits. My first guess would be that you're running out of disk, at least that's the first thing I'd check next... Best Erick On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.de wrote: Hey all, I'm trying to create a Solr index for the 2010 Medline-baseline ( www.pubmed.gov, over 18 million XML documents). My goal is to be able to retrieve single XML documents by their ID. Each document comes with a unique ID, the PubMedID. So my schema (important portions) looks like this: field name=pmid type=string indexed=true stored=true required=true / field name=date type=tdate indexed=true stored=true/ field name=xml type=text indexed=true stored=true/ uniqueKeypmid/uniqueKey defaultSearchFieldpmid/defaultSearchField pmid holds the ID, data hold the creation date; xml holds the whole XML document (mostly below 5kb). I used the DataImporter to do this. I had to write some classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, the error could lie there. What happens is that indexing just looks fine at the beginning. Memory usage is quite below the maximum (max of 20g, usage of below 5g, most of the time around 3g). It goes several hours in this manner until it suddenly stopps. I tried this a few times with minor tweaks, non of which made any difference. The last time such a crash occurred, over