They're not mutually exclusive. Part of your index size is because you
*store*
the full xml, which means that a verbatim copy of the raw data is placed in
the
index, along with the searchable terms. Including the tags. This only makes
sense if you're going to return the original data to the user AND use the
index
to hold it.

Storing has nothing to do with searching (again, pardon me if this is
obvious),
which can be confusing. I claim you could reduce the size of your index
dramatically without losing any search capability by simply NOT storing
the XML blob, just indexing it. But that may not be what you need to do,
only you know your problem space.....

Which brings up the question whether it makes sense to index the
XML tags, but again that will be defined by your problem space. If
you have a well-defined set of input tags, you could consider indexing
each of the tags in a unique field, but the query then gets complicated.

I've seen more than a few situations where trying to use a RDBMSs
search capabilities doesn't work as the database gets larger, and
your's qualifies as "larger". In particular, RDBMSs don't have very
sophisticated search capabilities, and the speed gets pretty bad.
That's OK, because Solr doesn't have very good join capabilities,
different tools for different problems.

Best
Erick

On Tue, Nov 16, 2010 at 12:16 PM, Erik Fäßler <erik.faess...@uni-jena.de>wrote:

>  Thank you very much, I will have a read on your links.
>
> The full-text-red-flag is exactly the thing why I'm testing this with Solr.
> As was said before by Dennis, I could also use a database as long as I don't
> need sophisticated query capabilities. To be honest, I don't know the
> performance gap between a Lucene index and a database in such a case. I
> guess I will have to test it.
> This is thought as a substitution for holding every single file on disc.
> But I need the whole file information because it's not clear which
> information will be required in the future. And we don't want to re-index
> every time we add a new field (not yet, that is ;)).
>
> Best regards,
>
>    Erik
>
> Am 16.11.2010 16:27, schrieb Erick Erickson:
>
>> The key is that Solr handles merges by copying, and only after
>> the copy is complete does it delete the old index. So you'll need
>> at least 2x your final index size before you start, especially if you
>> optimize...
>>
>> Here's a handy matrix of what you need in your index depending
>> upon what you want to do:
>>
>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>>
>> Leaving out what you don't use will help by shrinking your index.
>>
>> <
>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>> >the
>>
>> thing that jumps out is that you're storing your entire XML document
>> as well as indexing it. Are you expecting to return the document
>> to the user? Storing the entire document is is a red-flag, you
>> probably don't want to do this. If you need to return the entire
>> document some time, one strategy is to index whatever you need
>> to search, and index what you need to fetch the document from
>> an external store. You can index the values of selected tags as fields in
>> your documents. That would also give you far more flexibility
>> when searching.
>>
>> Best
>> Erick
>>
>>
>>
>>
>> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<erik.faess...@uni-jena.de
>> >wrote:
>>
>>   Hello Erick,
>>>
>>> I guess I'm the one asking for pardon - but sure not you! It seems,
>>> you're
>>> first guess could already be the correct one. Disc space IS kind of short
>>> and I believe it could have run out; since Solr is performing a rollback
>>> after the failure, I didn't notice (beside the fact that this is one of
>>> our
>>> server machine, but apparently the wrong mount point...).
>>>
>>> I not yet absolutely sure of this, but it would explain a lot and it
>>> really
>>> looks like it. So thank you for this maybe not so obvious hint :)
>>>
>>> But you also mentioned the merging strategy. I left everything on the
>>> standards that come with the Solr download concerning these things.
>>> Could it be that such a great index needs another treatment? Could you
>>> point me to a Wiki page or something where I get a few tips?
>>>
>>> Thanks a lot, I will try building the index on a partition with enough
>>> space, perhaps that will already do it.
>>>
>>> Best regards,
>>>
>>>    Erik
>>>
>>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>>
>>>  Several questions. Pardon me if they're obvious, but I've spent faaaar
>>>
>>>> too much of my life overlooking the obvious...
>>>>
>>>> 1>   Is it possible you're running out of disk? 40-50G could suck up
>>>> a lot of disk, especially when merging. You may need that much again
>>>> free when a merge occurs.
>>>> 2>   speaking of merging, what are your merge settings? How are you
>>>> triggering merges. See<mergeFactor>   and associated in solrconfig.xml?
>>>> 3>   You might get some insight by removing the Solr indexing part, can
>>>> you spin through your parsing from beginning to end? That would
>>>> eliminate your questions about whether you're XML parsing is the
>>>> problem.
>>>>
>>>>
>>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>>> so you're not hitting any built-in limits.
>>>>
>>>> My first guess would be that you're running out of disk, at least
>>>> that's the first thing I'd check next...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faess...@uni-jena.de
>>>>
>>>>> wrote:
>>>>>
>>>>   Hey all,
>>>>
>>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>>> www.pubmed.gov, over 18 million XML documents). My goal is to be able
>>>>> to
>>>>> retrieve single XML documents by their ID. Each document comes with a
>>>>> unique
>>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>>
>>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>>> required="true" />
>>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>>
>>>>> <uniqueKey>pmid</uniqueKey>
>>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>>
>>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>>> document (mostly below 5kb). I used the DataImporter to do this. I had
>>>>> to
>>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>>> so
>>>>> theoretically, the error could lie there.
>>>>>
>>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most
>>>>> of
>>>>> the
>>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>>> stopps. I tried this a few times with minor tweaks, non of which made
>>>>> any
>>>>> difference. The last time such a crash occurred, over 16.5 million
>>>>> documents
>>>>> already had been indexed (argh, so close...). It never stops at the
>>>>> same
>>>>> document and trying to index the documents, where the error occurred,
>>>>> just
>>>>> runs fine. Index size on disc was between 40g and 50g the last time I
>>>>> had
>>>>> a
>>>>> look.
>>>>>
>>>>> This is the log from beginning to end:
>>>>>
>>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>>
>>>>> As you can see, Solr's error message is not quite complete. There are
>>>>> no
>>>>> closing brackets. The document is cut in half on this message and not
>>>>> even
>>>>> the error message itself is complete: The 'D' of
>>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>>> text
>>>>> is missing.
>>>>>
>>>>> I have one thought concerning this: I get the input documents as an
>>>>> InputStream which I read buffer-wise (at most 1000bytes per read()
>>>>> call).
>>>>> I
>>>>> need to deliver the documents in one large byte-Array to the XML parser
>>>>> I
>>>>> use (VTD XML).
>>>>> But I don't only get the individual small XML documents but always one
>>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>>> self-written
>>>>> EntityProcessor to extract the single documents from the larger blob.
>>>>> These
>>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>>> large
>>>>> blobs in 1000bytes steps and store each byte array in an
>>>>> ArrayList<byte[]>.
>>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from
>>>>> the
>>>>> ArrayList into the byte[].
>>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>>> documents where the error occurred just works fine (that is, indexing
>>>>> the
>>>>> whole blob containing the single document). I just mention this because
>>>>> it
>>>>> kind of looks like there is this cut in the document and the missing
>>>>> 'D'
>>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>>> the
>>>>> error log in vi doesn't show any broken characters (the last time I had
>>>>> such
>>>>> problems, vi could identify the characters in question, other editors
>>>>> just
>>>>> wouldn't show them).
>>>>>
>>>>> Further ideas from my side: Is the index too big? I think I read
>>>>> something
>>>>> about a large index would be something around 10million documents, I
>>>>> aim
>>>>> to
>>>>> approximately double this number. But would this cause such an error?
>>>>> In
>>>>> the
>>>>> end: What exactly IS the error?
>>>>>
>>>>> Sorry for the lot of text, just trying to describe the problem as
>>>>> detailed
>>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>>
>>>>> Best regards,
>>>>>
>>>>>    Erik
>>>>>
>>>>>
>>>>>
>

Reply via email to