Simple text .txt files and MS office .doc files are very very different beasts.
You can do simple .txt files with some more lines in your
DataImportHandler script.
With DOC files it is easiest to use the extracting request handler
*/extract". This is on the wiki.
If you want to do this inside the DataImporthandler, you need to use
3.x or the trunk. And it has bugs.

On Wed, Sep 29, 2010 at 3:55 PM, Savannah Beckett
<savannah_becket...@yahoo.com> wrote:
> No, I am using xpath for html, this is not the question.  I am indexing pure
> text in addition to html that I was indexing.  Pure text like TXT file or
> Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into
> different fields in my index like the way I use xpath to index html into
> differernt fields in my index?
>
> My question is referring to pure TXT like .txt file and microsoft word, not
> html.  I am completely fine with html.
> Thanks.
>
>
>
>
> ________________________________
> From: Erick Erickson <erickerick...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, September 29, 2010 2:59:26 PM
> Subject: Re: How to Index Pure Text into Seperate Fields?
>
> Can you provide a few more details? You mention xpath, which leads me
> to believe that you are using DIH, is that true? How are you getting
> your documents to index? Parts of a filesystem?
>
> Because it's possible to do many things. If you're using DIH against a
> filesystem,
> you could use two fileDataSources, one that works only on files with
> a particular extension (xml, say) and another that processes .txt files.
>
> But that said, if you're trying to index "just the text" of a Word document,
> you
> have to parse it quite differently than a plain text file, take a look at
> Tika.
>
> Al of which may not help you at all, because I'm guessing...
>
> So I think a more complete problem statement would help us help you.
>
> Best
> Erick
>
> On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
> savannah_becket...@yahoo.com> wrote:
>
>> Hi,
>>  I am using xpath to index different parts of the html pages into
>> different
>> fields.  Now, I have some pure text documents that has no html.  So I can't
>> use
>> xpath.  How do I index these pure text into different fields of the index?
>> How
>> do I make nutch/solr understand these different parts belong to different
>> fields?  Maybe I can use existing content in the fields in my index?
>> Thanks.
>>
>>
>>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to