Re: How to change tmp directory

2012-07-05 Thread Erik Fäßler
Ah - allright, that's it! Thank you!

Erik

Am 04.07.2012 um 17:59 schrieb Jack Krupansky:

 Solr is probably simply using Java's temp directory, which you can redefine 
 by setting the java.io.tmpdir system property on the java command line or 
 using a system-specific environment variable.
 
 -- Jack Krupansky
 
 -Original Message- From: Erik Fäßler
 Sent: Wednesday, July 04, 2012 3:56 AM
 To: solr-user@lucene.apache.org
 Subject: How to change tmp directory
 
 Hello all,
 
 I came about an odd issue today when I wanted to add ca. 7M documents to my 
 Solr index: I got a SolrServerException telling me No space left on device. 
 I had a look at the directory Solr (and its index) is installed in and there 
 is plenty space (~300GB).
 I then noticed a file named upload_457ee97b_1385125274b__8000_0005.tmp 
 had taken up all space of the machine's /tmp directory. The partition holding 
 the /tmp directory only has around 1GB of space and this file already took 
 nearly 800MB. I had a look at it and I realized that the file contained the 
 data I was adding to Solr in an XML format.
 
 Is there a possibility to change the temporary directory for this action?
 
 I use an IteratorSolrInputDocument with the HttpSolrServer's add(Iterator) 
 method for performance. So I can't just do commits from time to time.
 
 Best regards,
 
 Erik 



How to change tmp directory

2012-07-04 Thread Erik Fäßler
Hello all,

I came about an odd issue today when I wanted to add ca. 7M documents to my 
Solr index: I got a SolrServerException telling me No space left on device. I 
had a look at the directory Solr (and its index) is installed in and there is 
plenty space (~300GB).
I then noticed a file named upload_457ee97b_1385125274b__8000_0005.tmp 
had taken up all space of the machine's /tmp directory. The partition holding 
the /tmp directory only has around 1GB of space and this file already took 
nearly 800MB. I had a look at it and I realized that the file contained the 
data I was adding to Solr in an XML format.

Is there a possibility to change the temporary directory for this action?

I use an IteratorSolrInputDocument with the HttpSolrServer's add(Iterator) 
method for performance. So I can't just do commits from time to time.

Best regards,

Erik

Stats Component and solrj

2012-04-24 Thread Erik Fäßler
Hey all,

I'd like to know how many terms I have in a particular field in a search. In 
other words, I want to know how many facets I have in that field. I use string 
fields, there are no numbers. I wanted to use the Stats Component and use its 
count value. When trying this out in the browser, everything works like 
expected.
However, when I want to do the same thing in my Java web app, I get an error 
because in FieldStatsInfo.class it says

 min = (Double)entry.getValue();

Where 'entry.getValue()' is a String because I have a string field here. Thus, 
I get an error that String cannot be cast to Double.
In the browser I just got a String returned here, probably relative to an 
lexicographical order.

I switched the Stats Component on with

query.setGetFieldStatistics(authors);

Where 'authors' is a field with author names.
Is it possible that solrj not yet works with the Stats Component on string 
fields? I tried Solr 3.5 and 3.6 without success. Is there another easy way to 
get the count I want? Will solrj be fixed? Or am I just doing an error?

Best regards,

Erik

Re: Field length and scoring

2012-03-24 Thread Erik Fäßler
Ahh, that's it - I thought of such a thing but couldn't find a proper 
affirmation with Google.

Thank you both for your answers. I guess I will just sort by value length 
myself.

Only one thing: Erick said my examples would both be one token long. But I 
rather think, there are both one value long but three and four tokens long, 
as the NGramAnalyzer splits the values in smaller tokens. And as it can be seen 
from the link given by Ahmet, field lengths of three and four are not 
distinguished - where the reason for my observation lies.

Thanks again and best regards,

Erik


On 24.03.2012, at 00:02, Ahmet Arslan iori...@yahoo.com wrote:

 Also, the field length is enocded in a byte (as I remember).
 So it's
 quite possible that,
 even if the lengths of these fields were 3 and 4 instead of
 both being
 1, the value
 stored for the length norms would be the same number.
 
 Exactly. http://search-lucene.com/m/uGKRu1pvRjw
 



Field length and scoring

2012-03-23 Thread Erik Fäßler
Hello there,

I have a quite basic question but my Solr is behaving in a way I'm not quite 
sure of why it does so.

The setup is simple: I have a field suggestionText in which single strings 
are indexed. Schema:

 field name=suggestionText type=prefixNGram indexed=true stored=true/

Since I want this field to serve for a suggestion-search, the input string is 
analyzed by a EdgeNGramFilter.

Lets have a look on two cases:

case1: Input string was 'il2'
case2: Input string was 'il24'

As I can see from the Solr-admin-analysis-page, case1 is analysed as

i
il
il2

and case2 as

i
il
il2
il24

As you would expect. The point now is: When I search for 'il2' I would expect 
case1 to have a higher score than case2. I thought this way because I did not 
omit norms and thus I thought, the shorter field would get a (slightly) higher 
score. However, the scores in both cases are identical and so it happens that 
'il24' is suggested prior to 'il2'.

Perhaps I did understand the norms or the notion of field length wrong. I 
would be grateful if you could help me out here and give me advice on how to 
accomplish the wished behavior.

Thanks and best regards,

Erik

Uncomplete date expressions

2011-10-29 Thread Erik Fäßler
Hi all,

I want to index MEDLINE documents which not always contain complete dates of 
publication. The year is known always. Now the Solr documentation states, dates 
must have the format 1995-12-31T23:59:59Z for which month, day and even the 
time of the day must be known.
I could, of course, just complement uncomplete dates with default values, 01-01 
for example. But then I won't be able to distinguish between complete and 
uncomplete dates afterwards which is of importance when displaying the 
documents.

I could just store the known information, e.g. the year, into an integer-typed 
field, but then I won't have date math.

Is there a good solution to my problem? Probably I'm just missing the obvious, 
perhaps you can help me :-)

Best regards,

Erik

Re: Uncomplete date expressions

2011-10-29 Thread Erik Fäßler
Hello François,

thank you for your quick reply. I thought about just storing which information 
I am lacking and this would be a possibility of course. It just seemed a bit 
like quickdirty to me and I wondered whether Solr really cannot understand 
dates which only consist of the year. Isn't it a common case that a date/time 
expression is not determined to the hour, for example? But if there is no other 
possibility I will stick with your suggestion, thank you!

Best,

Erik

Am 29.10.2011 um 15:20 schrieb François Schiettecatte:

 Erik
 
 I would complement the date with default values as you suggest and store a 
 boolean flag indicating whether the date was complete or not, or store the 
 original date if it is not complete which would probably be better because 
 the presence of that data would tell you that the original date was not 
 complete and you would also have it too.
 
 Cheers
 
 François
 
 On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote:
 
 Hi all,
 
 I want to index MEDLINE documents which not always contain complete dates of 
 publication. The year is known always. Now the Solr documentation states, 
 dates must have the format 1995-12-31T23:59:59Z for which month, day and 
 even the time of the day must be known.
 I could, of course, just complement uncomplete dates with default values, 
 01-01 for example. But then I won't be able to distinguish between complete 
 and uncomplete dates afterwards which is of importance when displaying the 
 documents.
 
 I could just store the known information, e.g. the year, into an 
 integer-typed field, but then I won't have date math.
 
 Is there a good solution to my problem? Probably I'm just missing the 
 obvious, perhaps you can help me :-)
 
 Best regards,
 
  Erik
 



Facetting: Some questions concerning method:fc

2011-05-19 Thread Erik Fäßler

 Hey all!

I have a few questions concerning the field cache method for faceting.
The wiki says for enum method: This was the default (and only) method 
for faceting multi-valued fields prior to Solr 1.4. . And for fc 
method: This was the default method for single valued fields prior to 
Solr 1.4. .
I just ran into the problem of using fc for a field which can have 
multiple terms for one field. The facet counts would be wrong, seemingly 
only counting the first term in the field of each document. I observed 
this in Solr 1.4.1 and in 3.1 with the same index.


Question 1: The quotes above say prior to Solr 1.4. Has this changed? 
Is there another method for multi-valued faceting since Solr 1.4?
Question 2: Very weird is another observation: When faceting on another 
field, namely the text field holding a large variety of terms and 
especially a lot of different terms in one single field, the fc method 
seems to count everything correctly. In fact, the results between fc and 
enum don't seem to differ. The field in which the fc and enum faceting 
results differ consists of a lot of terms which have all start- end end 
offsets 0, 0 and position increment 1. Could this be a problem?


Best regards,

Erik


Re: Facetting: Some questions concerning method:fc

2011-05-19 Thread Erik Fäßler

 Am 19.05.2011 16:07, schrieb Yonik Seeley:

On Thu, May 19, 2011 at 9:56 AM, Erik Fäßlererik.faess...@uni-jena.de  wrote:

I have a few questions concerning the field cache method for faceting.
The wiki says for enum method: This was the default (and only) method for
faceting multi-valued fields prior to Solr 1.4. . And for fc method: This
was the default method for single valued fields prior to Solr 1.4. .
I just ran into the problem of using fc for a field which can have multiple
terms for one field. The facet counts would be wrong, seemingly only
counting the first term in the field of each document. I observed this in
Solr 1.4.1 and in 3.1 with the same index.

That doesn't sound right... the results should always be identical
between facet.method=fc and facet.method=enum. Are you sure you didn't
index a multi-valued field and then change the fieldType in the schema
to be single valued? Are you sure the field is indexed the way you
think it is?  If so, is there an easy way for someone to reproduce
what you are seeing?

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco
Thanks a lot for your help: Changing the field type to multiValued did 
the trick. The point is, I built the index using Lucene directly (I need 
to for some special manipulation of offsets and position increments). So 
my question is which requirements a Lucene field has to fulfill so 
Solr's faceting works correctly.
Particular question: In Lucene terms, what exactly is denoted by a 
multiValued field? I thought that would result in multiple Lucene 
Field instances with the same name for a single document. But I think my 
field has only one instance per document (but I could check that back).


Thanks again for your quick and helpful answer!

Erik


Re: SOLR 1.4.1 : Indexing DateField time zone problem

2010-11-25 Thread Erik Fäßler
Hm - but I observed this, too. And I didn't do anything with SQL at all. I was 
parsing date strings out of XML, creating a string which could be formatted 
using DIH's DateFormatTransformer. But the indexed dates have been a few hours 
too early in my case, switching back the dates to one day before. I didn't go 
deeply into this, I think I was experiencing a conversion of my dates strings 
from my time zone into UTC. My quick solution was to write another version of 
the DateFormatTransformer which takes a timeZone attribute. This way, the date 
strings shown in the indexed documents showed the correct date (which was what 
I wanted).
But I guess doing it this way also wasn't then best solution because when using 
date range math I ran into other time zone conversion problems, due to my own 
conversions earlier I think.

But until now I didn't go deeper into this so I don't know the exact reasons 
(although I'm sure it's not really a too challenging problem) and I haven't 
done a solution yet.

Best regards,

Erik


Am 25.11.2010 um 18:04 schrieb Erick Erickson erickerick...@gmail.com:

 I don't believe this is a Solr issue at all. I suspect your MySql query is
 doing the
 timezone change. Solr doesn't apply any processing to the date, it doesn't
 need to because times are all Zulu.
 
 There's a little known debug console for DIH, see:
 http://wiki.apache.org/solr/DataImportHandler#interactive
 http://wiki.apache.org/solr/DataImportHandler#interactivethat might help a
 lot. I think what you need to do is apply a transformation in your
 SQL statement to get times in UTC, somthing like CONVERT_TZ or some such,
 see:
 http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_convert-tz
 
 Best
 Erick
 
 On Thu, Nov 25, 2010 at 5:27 AM, Shanmugavel SRD
 srdshanmuga...@gmail.comwrote:
 
 
 I am using SOLR 1.4.1. My SOLR runs in a server which is in EST zone.
 I am trying to index a date field which is in MySQL as
 '2007-08-08T05:36:50Z' but while indexing it becomes '2007-08-08T09:36:50Z'
 where 4 hours got increased. But I want the date as is while indexing,
 means, after indexing I want the value as '2007-08-08T05:36:50Z' in
 'modified_d' field.
 
 Can anyone help me on this?
 
 field column=post_modified name=modified_d
 dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
 
 I searched in this forum and there are discussions on this same problem but
 on SOLR 1.3, that's why I am posting this query again.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-1-4-1-Indexing-DateField-time-zone-problem-tp1966118p1966118.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: DIH full-import failure, no real error message

2010-11-19 Thread Erik Fäßler

 Hello Erick,

I guess I'm the one asking for pardon - but sure not you! It seems, 
you're first guess could already be the correct one. Disc space IS kind 
of short and I believe it could have run out; since Solr is performing a 
rollback after the failure, I didn't notice (beside the fact that this 
is one of our server machine, but apparently the wrong mount point...).


I not yet absolutely sure of this, but it would explain a lot and it 
really looks like it. So thank you for this maybe not so obvious hint :)


But you also mentioned the merging strategy. I left everything on the 
standards that come with the Solr download concerning these things.
Could it be that such a great index needs another treatment? Could you 
point me to a Wiki page or something where I get a few tips?


Thanks a lot, I will try building the index on a partition with enough 
space, perhaps that will already do it.


Best regards,

Erik

Am 16.11.2010 14:19, schrieb Erick Erickson:

Several questions. Pardon me if they're obvious, but I've spent fr
too much of my life overlooking the obvious...

1  Is it possible you're running out of disk? 40-50G could suck up
a lot of disk, especially when merging. You may need that much again
free when a merge occurs.
2  speaking of merging, what are your merge settings? How are you
triggering merges. SeemergeFactor  and associated in solrconfig.xml?
3  You might get some insight by removing the Solr indexing part, can
you spin through your parsing from beginning to end? That would
eliminate your questions about whether you're XML parsing is the
problem.


40-50G is a large index, but it's certainly within Solr's capability,
so you're not hitting any built-in limits.

My first guess would be that you're running out of disk, at least
that's the first thing I'd check next...

Best
Erick

On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.dewrote:


  Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (
www.pubmed.gov, over 18 million XML documents). My goal is to be able to
retrieve single XML documents by their ID. Each document comes with a unique
ID, the PubMedID. So my schema (important portions) looks like this:

field name=pmid type=string indexed=true stored=true
required=true /
field name=date type=tdate indexed=true stored=true/
field name=xml type=text indexed=true stored=true/

uniqueKeypmid/uniqueKey
defaultSearchFieldpmid/defaultSearchField

pmid holds the ID, data hold the creation date; xml holds the whole XML
document (mostly below 5kb). I used the DataImporter to do this. I had to
write some classes (DataSource, EntityProcessor, DateFormatter) myself, so
theoretically, the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory
usage is quite below the maximum (max of 20g, usage of below 5g, most of the
time around 3g). It goes several hours in this manner until it suddenly
stopps. I tried this a few times with minor tweaks, non of which made any
difference. The last time such a crash occurred, over 16.5 million documents
already had been indexed (argh, so close...). It never stops at the same
document and trying to index the documents, where the error occurred, just
runs fine. Index size on disc was between 40g and 50g the last time I had a
look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no
closing brackets. The document is cut in half on this message and not even
the error message itself is complete: The 'D' of
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text
is missing.

I have one thought concerning this: I get the input documents as an
InputStream which I read buffer-wise (at most 1000bytes per read() call). I
need to deliver the documents in one large byte-Array to the XML parser I
use (VTD XML).
But I don't only get the individual small XML documents but always one
larger XML blob with exactly 30,000 of these documents. I use a self-written
EntityProcessor to extract the single documents from the larger blob. These
blobs have a size of about 50 to 150mb. So what I do is to read these large
blobs in 1000bytes steps and store each byte array in an ArrayListbyte[].
Afterwards, I create the ultimate byte[] and do System.arraycopy from the
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the
documents where the error occurred just works fine (that is, indexing the
whole blob containing the single document). I just mention this because it
kind of looks like there is this cut in the document and the missing 'D'
reminds me of char-encoding errors. But I don't know for real, opening the
error log in vi doesn't show any broken characters (the last time I had such
problems, vi could identify the characters in question, other editors just
wouldn't show them).

Further ideas from my side: Is 

Re: DIH full-import failure, no real error message

2010-11-19 Thread Erik Fäßler
Yes, I noticed just after sending the message.
My apologies!

Best,

Erik

Am 20.11.2010 um 00:32 schrieb Chris Hostetter hossman_luc...@fucit.org:

 
 : Subject: DIH full-import failure, no real error message
 : References: aanlktinqsw22n0vj7at3nbx4=ocmdesjq=q0y+rbp...@mail.gmail.com
 : In-Reply-To: aanlktinqsw22n0vj7at3nbx4=ocmdesjq=q0y+rbp...@mail.gmail.com
 
 http://people.apache.org/~hossman/#threadhijack
 Thread Hijacking on Mailing Lists
 
 When starting a new discussion on a mailing list, please do not reply to 
 an existing message, instead start a fresh email.  Even if you change the 
 subject line of your email, other mail headers still track which thread 
 you replied to and your question is hidden in that thread and gets less 
 attention.   It makes following discussions in the mailing list archives 
 particularly difficult.
 
 
 -Hoss


Re: DIH full-import failure, no real error message

2010-11-17 Thread Erik Fäßler
 Yes, I knew index and storing would pose a heavy load but I wanted to 
give it a try. The storing has to be for the goal I'd like to archive. 
We use a UIMA NLP-Pipeline to process the Medline documents and we 
already have a Medline-XML reader. Everything's fine with all this 
except until now we just stored every single XML document on disc and 
saved all the paths of the exact documents we wanted to process on a 
particular run in a database. Then, our UIMA CollectionReader would 
retrieve a batch of file paths from the database, read the files and 
process them.
This worked fine and it still will - but importing into the database can 
take quite a long time because we have to traverse the file system tree 
for the correct files. We arranged the files so we can find them more 
easily. But still, extracting all the individual files from the larger 
XML blobs takes to much time and Inodes ;)
This is why I'm doing a Solr index (nice benefit here: I could implement 
search) and - as an alternative -  store them in a database for 
retrieval; I will experiment with both solutions and check out which 
better fulfills my needs. But until this point it is necessary to 
retrieve the full documents, otherwise I'd have to re-evaluate and 
partly rewrite our UIMA-Pipelines. Perhaps this will be the way to go, 
but this would be really time consuming and I'd only do this if there 
are great benefits.


It seems, David's solution would be ideal for us; perhaps I will have a 
read on the cloud-branch, and HBase in particular.


But - as long Solr can take the effort of storing the whole XML 
documents - of course I can switch the indexing of the XML off. I may 
need the whole XML for retrieval, but I can identify particular parts of 
the XML we'd like to search. These can be extracted easily so this is a 
good idea, of course.


Thanks for all your great advices and help, I really appreciate!

Best,

Erik


Am 17.11.2010 01:55, schrieb Erick Erickson:

They're not mutually exclusive. Part of your index size is because you
*store*
the full xml, which means that a verbatim copy of the raw data is placed in
the
index, along with the searchable terms. Including the tags. This only makes
sense if you're going to return the original data to the user AND use the
index
to hold it.

Storing has nothing to do with searching (again, pardon me if this is
obvious),
which can be confusing. I claim you could reduce the size of your index
dramatically without losing any search capability by simply NOT storing
the XML blob, just indexing it. But that may not be what you need to do,
only you know your problem space.

Which brings up the question whether it makes sense to index the
XML tags, but again that will be defined by your problem space. If
you have a well-defined set of input tags, you could consider indexing
each of the tags in a unique field, but the query then gets complicated.

I've seen more than a few situations where trying to use a RDBMSs
search capabilities doesn't work as the database gets larger, and
your's qualifies as larger. In particular, RDBMSs don't have very
sophisticated search capabilities, and the speed gets pretty bad.
That's OK, because Solr doesn't have very good join capabilities,
different tools for different problems.

Best
Erick

On Tue, Nov 16, 2010 at 12:16 PM, Erik Fäßlererik.faess...@uni-jena.dewrote:


  Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with Solr.
As was said before by Dennis, I could also use a database as long as I don't
need sophisticated query capabilities. To be honest, I don't know the
performance gap between a Lucene index and a database in such a case. I
guess I will have to test it.
This is thought as a substitution for holding every single file on disc.
But I need the whole file information because it's not clear which
information will be required in the future. And we don't want to re-index
every time we add a new field (not yet, that is ;)).

Best regards,

Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:


The key is that Solr handles merges by copying, and only after
the copy is complete does it delete the old index. So you'll need
at least 2x your final index size before you start, especially if you
optimize...

Here's a handy matrix of what you need in your index depending
upon what you want to do:

http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase

Leaving out what you don't use will help by shrinking your index.


http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase

the

thing that jumps out is that you're storing your entire XML document
as well as indexing it. Are you expecting to return the document
to the user? Storing the entire document is is a red-flag, you
probably don't want to do this. If you need to return the entire
document some time, one strategy is to index whatever you need

Re: DIH full-import failure, no real error message

2010-11-17 Thread Erik Fäßler

 Hi Tommaso,

I'm not sure I saw exactly that but there was a Solr-UIMA-contribution a 
few months ago and I had a look. I didn't go into details, because our 
search engine isn't upgraded to Solr yet (but is to come). But I will 
keep your link, perhaps this will proof useful to me, thank you!


Best regards,

Erik

Am 17.11.2010 16:25, schrieb Tommaso Teofili:

Hi Erik

2010/11/17 Erik Fäßlererik.faess...@uni-jena.de


. But until this point it is necessary to retrieve the full documents,
otherwise I'd have to re-evaluate and partly rewrite our UIMA-Pipelines.


Did you see https://issues.apache.org/jira/browse/SOLR-2129 for enhancing
docs with UIMA pipelines just before they get indexed in Solr?
Cheers,
Tommaso





DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler

 Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline 
(www.pubmed.gov, over 18 million XML documents). My goal is to be able 
to retrieve single XML documents by their ID. Each document comes with a 
unique ID, the PubMedID. So my schema (important portions) looks like this:


field name=pmid type=string indexed=true stored=true 
required=true /

field name=date type=tdate indexed=true stored=true/
field name=xml type=text indexed=true stored=true/

uniqueKeypmid/uniqueKey
defaultSearchFieldpmid/defaultSearchField

pmid holds the ID, data hold the creation date; xml holds the whole XML 
document (mostly below 5kb). I used the DataImporter to do this. I had 
to write some classes (DataSource, EntityProcessor, DateFormatter) 
myself, so theoretically, the error could lie there.


What happens is that indexing just looks fine at the beginning. Memory 
usage is quite below the maximum (max of 20g, usage of below 5g, most of 
the time around 3g). It goes several hours in this manner until it 
suddenly stopps. I tried this a few times with minor tweaks, non of 
which made any difference. The last time such a crash occurred, over 
16.5 million documents already had been indexed (argh, so close...). It 
never stops at the same document and trying to index the documents, 
where the error occurred, just runs fine. Index size on disc was between 
40g and 50g the last time I had a look.


This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no 
closing brackets. The document is cut in half on this message and not 
even the error message itself is complete: The 'D' of 
(D)ataImporter.runCmd(DataImporter.java:389) right after the document 
text is missing.


I have one thought concerning this: I get the input documents as an 
InputStream which I read buffer-wise (at most 1000bytes per read() 
call). I need to deliver the documents in one large byte-Array to the 
XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one 
larger XML blob with exactly 30,000 of these documents. I use a 
self-written EntityProcessor to extract the single documents from the 
larger blob. These blobs have a size of about 50 to 150mb. So what I do 
is to read these large blobs in 1000bytes steps and store each byte 
array in an ArrayListbyte[]. Afterwards, I create the ultimate byte[] 
and do System.arraycopy from the ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the 
documents where the error occurred just works fine (that is, indexing 
the whole blob containing the single document). I just mention this 
because it kind of looks like there is this cut in the document and the 
missing 'D' reminds me of char-encoding errors. But I don't know for 
real, opening the error log in vi doesn't show any broken characters 
(the last time I had such problems, vi could identify the characters in 
question, other editors just wouldn't show them).


Further ideas from my side: Is the index too big? I think I read 
something about a large index would be something around 10million 
documents, I aim to approximately double this number. But would this 
cause such an error? In the end: What exactly IS the error?


Sorry for the lot of text, just trying to describe the problem as 
detailed as possible. Thanks a lot for reading and I appreciate any 
ideas! :)


Best regards,

Erik
15.11.2010 11:08:22 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1289465394071
15.11.2010 18:16:06 org.apache.solr.handler.dataimport.SolrWriter upload
WARNUNG: Error creating document : SolrInputDocument[{pmid=pmid(1.0)={8817856}, 
xml=xml(1.0)={MedlineCitation Owner=NLM Status=MEDLINE
PMID8817856/PMID
DateCreated
Year1996/Year
Month12/Month
Day04/Day
/DateCreated
DateCompleted
Year1996/Year
Month12/Month
Day04/Day
/DateCompleted
DateRevised
Year2004/Year
Month11/Month
Day17/Day
/DateRevised
Article PubModel=Print
Journal
ISSN IssnType=Print0042-4900/ISSN
JournalIssue CitedMedium=Print
Volume138/Volume
Issue26/Issue
PubDate
Year1996/Year
MonthJun/Month
Day29/Day
/PubDate
/JournalIssue
TitleThe Veterinary record/Title
ISOAbbreviationVet. Rec./ISOAbbreviation
/Journal
ArticleTitleRestoring confidence in beef: towards a European 
solution./ArticleTitle
Pagination
MedlinePgn631-2/MedlinePgn
/Pagination
Languageeng/Language
PublicationTypeList
PublicationTypeNews/PublicationType
/PublicationTypeList
/Article
MedlineJournalInfo
CountryENGLAND/Country
MedlineTAVet Rec/MedlineTA
NlmUniqueID0031164/NlmUniqueID
ISSNLinking0042-4900/ISSNLinking
/MedlineJournalInfo
CitationSubsetIM/CitationSubset
MeshHeadingList
MeshHeading
DescriptorName MajorTopicYN=NAnimals/DescriptorName
/MeshHeading
MeshHeading
DescriptorName MajorTopicYN=NCattle/DescriptorName
/MeshHeading
MeshHeading
DescriptorName MajorTopicYN=NCommerce/DescriptorName

Re: DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler
 Retrieval by ID would only be one possible case; I'm still at the 
beginning of the project, I imagine adding more fields for more 
complicated queries in the future. I imagine a where - like query over 
all the XML documents stored in a DBMS wouldn't be too performant ;)


And at a later stage I will process all these documents and add lots of 
metadata - then by latest, I will need a Lucene Index rather than a 
database. So I'd by interested in solution ideas to my issue all the same.


Regards,

Erik

Am 16.11.2010 11:35, schrieb Dennis Gearon:

Wow, if all you want is to retrieve by ID, a database would be fine, even a NO
SQL database.


  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Erik Fäßlererik.faess...@uni-jena.de
To: solr-user@lucene.apache.org
Sent: Tue, November 16, 2010 12:33:28 AM
Subject: DIH full-import failure, no real error message

Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov,
over 18 million XML documents). My goal is to be able to retrieve single XML
documents by their ID. Each document comes with a unique ID, the PubMedID. So my
schema (important portions) looks like this:

field name=pmid type=string indexed=true stored=true required=true /
field name=date type=tdate indexed=true stored=true/
field name=xml type=text indexed=true stored=true/

uniqueKeypmid/uniqueKey
defaultSearchFieldpmid/defaultSearchField

pmid holds the ID, data hold the creation date; xml holds the whole XML document
(mostly below 5kb). I used the DataImporter to do this. I had to write some
classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically,
the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory usage is
quite below the maximum (max of 20g, usage of below 5g, most of the time around
3g). It goes several hours in this manner until it suddenly stopps. I tried this
a few times with minor tweaks, non of which made any difference. The last time
such a crash occurred, over 16.5 million documents already had been indexed
(argh, so close...). It never stops at the same document and trying to index the
documents, where the error occurred, just runs fine. Index size on disc was
between 40g and 50g the last time I had a look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no closing
brackets. The document is cut in half on this message and not even the error
message itself is complete: The 'D' of
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text is
missing.

I have one thought concerning this: I get the input documents as an InputStream
which I read buffer-wise (at most 1000bytes per read() call). I need to deliver
the documents in one large byte-Array to the XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one larger
XML blob with exactly 30,000 of these documents. I use a self-written
EntityProcessor to extract the single documents from the larger blob. These
blobs have a size of about 50 to 150mb. So what I do is to read these large
blobs in 1000bytes steps and store each byte array in an ArrayListbyte[].
Afterwards, I create the ultimate byte[] and do System.arraycopy from the
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the documents
where the error occurred just works fine (that is, indexing the whole blob
containing the single document). I just mention this because it kind of looks
like there is this cut in the document and the missing 'D' reminds me of
char-encoding errors. But I don't know for real, opening the error log in vi
doesn't show any broken characters (the last time I had such problems, vi could
identify the characters in question, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I read something about
a large index would be something around 10million documents, I aim to
approximately double this number. But would this cause such an error? In the
end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as detailed as
possible. Thanks a lot for reading and I appreciate any ideas! :)

Best regards,

 Erik





Re: DIH full-import failure, no real error message

2010-11-16 Thread Erik Fäßler

 Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with 
Solr. As was said before by Dennis, I could also use a database as long 
as I don't need sophisticated query capabilities. To be honest, I don't 
know the performance gap between a Lucene index and a database in such a 
case. I guess I will have to test it.
This is thought as a substitution for holding every single file on disc. 
But I need the whole file information because it's not clear which 
information will be required in the future. And we don't want to 
re-index every time we add a new field (not yet, that is ;)).


Best regards,

Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:

The key is that Solr handles merges by copying, and only after
the copy is complete does it delete the old index. So you'll need
at least 2x your final index size before you start, especially if you
optimize...

Here's a handy matrix of what you need in your index depending
upon what you want to do:
http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase

Leaving out what you don't use will help by shrinking your index.

http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCasethe
thing that jumps out is that you're storing your entire XML document
as well as indexing it. Are you expecting to return the document
to the user? Storing the entire document is is a red-flag, you
probably don't want to do this. If you need to return the entire
document some time, one strategy is to index whatever you need
to search, and index what you need to fetch the document from
an external store. You can index the values of selected tags as fields in
your documents. That would also give you far more flexibility
when searching.

Best
Erick




On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßlererik.faess...@uni-jena.dewrote:


  Hello Erick,

I guess I'm the one asking for pardon - but sure not you! It seems, you're
first guess could already be the correct one. Disc space IS kind of short
and I believe it could have run out; since Solr is performing a rollback
after the failure, I didn't notice (beside the fact that this is one of our
server machine, but apparently the wrong mount point...).

I not yet absolutely sure of this, but it would explain a lot and it really
looks like it. So thank you for this maybe not so obvious hint :)

But you also mentioned the merging strategy. I left everything on the
standards that come with the Solr download concerning these things.
Could it be that such a great index needs another treatment? Could you
point me to a Wiki page or something where I get a few tips?

Thanks a lot, I will try building the index on a partition with enough
space, perhaps that will already do it.

Best regards,

Erik

Am 16.11.2010 14:19, schrieb Erick Erickson:

  Several questions. Pardon me if they're obvious, but I've spent fr

too much of my life overlooking the obvious...

1   Is it possible you're running out of disk? 40-50G could suck up
a lot of disk, especially when merging. You may need that much again
free when a merge occurs.
2   speaking of merging, what are your merge settings? How are you
triggering merges. SeemergeFactor   and associated in solrconfig.xml?
3   You might get some insight by removing the Solr indexing part, can
you spin through your parsing from beginning to end? That would
eliminate your questions about whether you're XML parsing is the
problem.


40-50G is a large index, but it's certainly within Solr's capability,
so you're not hitting any built-in limits.

My first guess would be that you're running out of disk, at least
that's the first thing I'd check next...

Best
Erick

On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßlererik.faess...@uni-jena.de

wrote:

   Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (
www.pubmed.gov, over 18 million XML documents). My goal is to be able to
retrieve single XML documents by their ID. Each document comes with a
unique
ID, the PubMedID. So my schema (important portions) looks like this:

field name=pmid type=string indexed=true stored=true
required=true /
field name=date type=tdate indexed=true stored=true/
field name=xml type=text indexed=true stored=true/

uniqueKeypmid/uniqueKey
defaultSearchFieldpmid/defaultSearchField

pmid holds the ID, data hold the creation date; xml holds the whole XML
document (mostly below 5kb). I used the DataImporter to do this. I had to
write some classes (DataSource, EntityProcessor, DateFormatter) myself,
so
theoretically, the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory
usage is quite below the maximum (max of 20g, usage of below 5g, most of
the
time around 3g). It goes several hours in this manner until it suddenly
stopps. I tried this a few times with minor tweaks, non of which made any
difference. The last time such a crash occurred, over