Also, I believe that the output format matters. If your output is TEXTFILE I
think that all of the reducers can append to the same file concurrently.
However for block-based output formats, that isn’t possible.
From: Furcy Pin [mailto:pin.fu...@gmail.com]
Sent: Wednesday, August 08, 2018 9:58
are standard in
case of all files .Any idea how the schema would look if I use the stingray
reader?.I am guessing it would be more like
string,string,string,array(strings)?.
-Nishanth
On Fri, Jun 2, 2017 at 10:51 AM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
I wrote some
I wrote some custom python parsing scripts using StingRay Reader (
http://stingrayreader.sourceforge.net/cobol.html ) that read in the copybooks
and use the results to automatically generate hive table schema based on the
source copybook. The EBCDIC data is then extracted to TAB separated ASCII
XISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
and supposing that we have the data "in hand" (in memory or as CSV files) how
does one go about the 'proper split and partition' so it
ally
2) do anything specific in the input files / with the input files in order to
make partitioning work, or does Hive just take the data and take full care of
partitioning it?
On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
For A) I’d reco
For A) I’d recommend mapping an EXTERNAL table to the raw/original source
files…then you can just run a SELECT query from the EXTERNAL source and INSERT
into your destination.
LOAD DATA can be very useful when you are trying to move data between two
tables that share the same schema but 1 table
FWIW, the wiki states that the function returns a string
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDFhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
From: Long, Andrew [loand...@amazon.com]
Sent: Thursday, June 30, 2016 5:31 PM
To: user@hive.apache.org
Su
This is really outside of the scope of Hive and would probably be better
addressed by the Spark community, however I can say that this very much depends
on your use case
Take a look at this discussion if you haven't already:
https://groups.google.com/forum/embed/#!topic/spark-users/GQoxJHAAt
reading this:
"but when I add 2000 new titles with 300 rows each"
I'm thinking that you are over-partitioning your data
I'm not sure exactly how that relates to the OOM error you are getting (it may
not)I'd test things out partitioning by date-only maybe date +
title_type, but adding
if you are doing group by, you could have potential duplicates on your
concat_wstake a look at using collect_set or collect_list. if you do
select col_a,
collect_set(concat_ws(', ',col_b,col_c))
from t
you will have an array of unique collection pairs...collect_list will give you
all pairs.
pawn mr jobs.
Shirish
On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
My $0.02
If you are running multiple concurrent queries on the data, you are probably
doing it wrong (or at least inefficiently)although this somewhat depends on
what
My $0.02
If you are running multiple concurrent queries on the data, you are probably
doing it wrong (or at least inefficiently)although this somewhat depends on
what type of files are backing your hive warehouse...
Let's assume that your data is NOT backed by ORC/parquet files, and tha
if your only problem with #2 is the issue of creating the external table, you
should be able to throw together a script running as a more privileged user
that could handle the task of creating the external table. Once the table is
created, the user should be able to access the read-only data.
In my opinion, this ultimately becomes a resource balance issue that you'll
need to test.
You have a fixed amount of memory (although you haven't said what it is). As
you increase the number of tasks, the available memory per task will decrease.
If the tasks run out of memory, they will either
collect_list(col) will give you an array with all of the data from that column
However, the scalability of this approach will have limits.
-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Monday, March 28, 2016 5:47 PM
To: user@hive.apache.org
Subject:
the query that you are using would have to be analyzed to know how much it
could be optimized.
The small tables should be able to be handled with a map-join, depending on
hive version, that may be happening automatically.
Hive will be doing the joins in stages.
You could manually implement the st
ORC files = optimized RC files
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet is similar to ORC, but a bit different.
http://parquet.apache.org/documentation/latest/
Parquet is a bit more of a "standard" file format outside of Hive, while ORC
files are primarily us
I'm very aware of the "textbook" approach to creating a partitioned table.
I'm searching for an easy/repeatable solution for the following workflow
requirements
1) An initial complex source query, with multiple joins from different source
tables, field substring extracts, type conversions, etc
If your original source is text, why don't you make your ORC-based table a hive
managed table instead of an external table.
Then you can load/partition your text data into the external table, query from
that and insert into your ORC-backed Hive managed table.
Theoretically, if you had your data
using Hive
Ryan,
Can you perhaps point me to example(s) of how this is done in Hive?
Thanks,
J. B. Rawlings
Senior Consultant
C: 425.233.1315
www.societyconsulting.com<http://www.societyconsulting.com/>
From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com]
Sent: Monday, February 1, 2016
https://github.com/myui/hivemall
as long as you are comfortable with java UDFs, the sky is really the
limit...it's not for everyone and spark does have many advantages, but they are
two tools that can complement each other in numerous ways.
I don't know that there is necessarily a universal "be
it can be done in hive...whether or not it is the "best choice" depends on
whether or not you have any other reason for your data to be in hive.
If you are wondering whether Hive is the best tool for accomplishing this one
taskit would probably be easier to do in pig.
From: JB Rawlings [mail
Mich, if you have a toolpath that you can use to pipeline the required edits to
the source file, you can use a chain similar to this:
hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII |
sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put -
/etl/${table_name
either use a multi table insert to write the results of the source table into
another file/table:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT
or use windowing and analytics functions to run a count over the entire table
as a separate results co
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableLike
-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, December 10, 2015 11:09 AM
To: user@hive.apache.org
Subject: Create hive table with same sc
Each record is being returned.
For each record, the last_seen_dt is calculated for the window.
It sounds like you are looking for the last record, which would be the record
where hit_time = last_seen_dt
try adding that as a where clause.
From: Justin Workman [mailto:justinjwork...@gmail.com]
Sent
t a.X, a.Y, b.Z
insert OVERWRITE TABLE count_A select count(a.X)
insert OVERWRITE TABLE count_B select count(b.X)
;
From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com]
Sent: Wednesday, December 02, 2015 4:20 PM
To: user@hive.apache.org
Subject: RE: how to get counts as a byproduct of a
Personally, I'd do it this way...
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Select suba.X, suba.Y, suba.countA, subb.Z, subb.countB
FROM
(SELECT x, y, count(1) OVER (PARTITION BY X) as countA) suba
JOIN
(SELECT x, z, count(1) OVER (PARTITION BY X) as co
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT
From: Frank Luo [mailto:j...@merkleinc.com]
Sent: Wednesday, December 02, 2015 1:26 PM
To: user@hive.apache.org
Subject: RE: how to get counts as a byproduct of a query
Didn’t get any response, so tryi
r
opening brace.
So, could this be a bug?
Thanks,
Joel
On Tue, Oct 27, 2015 at 5:22 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
hmmm...I'm not sure what the return value type of json_tuple is...
I'd probably try creating a temporary table from your working q
SL2MQWwAAP4Qo.jpg","display_url":"pic.twitter.com/i3004WyF4g<http://pic.twitter.com/i3004WyF4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608994586624,"media_url_https":"https://pbs.twimg.com/
tched: 1 row(s)
hive>
However, if I try to get any data from the json array it's failing.
Thanks,
Joel
On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
looking at your sample data, you shouldn't need to use lateral view explode
u
here tr2.id<http://tr2.id>='654395184428515332'
LIMIT 1;
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
Thanks,
Joel
On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
Do you have an example of the query that you t
Do you have an example of the query that you tried (which failed).
In short, you probably want to use the get_json_object() UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
if you need the JSON array broken into individual records, you migh
If you want to use python...
The python script should expect tab-separated input on stdin and it should
return tab-separated delimited columns for the output...
add file mypython.py;
SELECT TRANSFORM (tbl.id, tbl.name, tbl.city)
USING 'python mypython.py'
AS (id, name, city, state)
FROM my_db.my_
depending on how you are submitting the statement to hive, you'll probably need
to escape the backslash...
try replacing every \ with \\
From: IT CTO [mailto:goi@gmail.com]
Sent: Thursday, October 01, 2015 6:25 AM
To: user@hive.apache.org
Subject: Re: Hive SerDe regex error
Your Regex gives
user@hive.apache.org; user@hive.apache.org
Subject: RE: Hive Generic UDF invoking Hbase
I believe It's not because of classpath. For a single task / for streaming
it's working fine right.
Sent from Outlook<http://aka.ms/Ox5hz3>
On Wed, Sep 30, 2015 at 1:58 PM -0700, "Rya
t 1:38 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
Also...
mapreduce.input.fileinputformat.split.maxsize
and, what is the size of your input files?
From: Ryan Harris
Sent: Wednesday, September 30, 2015 2:37 PM
To: 'user@hive.apache.org<mailto:user@
@hive.apache.org
Date: Wed, 30 Sep 2015 17:19:18 +
Take a look at hive.fetch.task.conversion in
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties, try
setting to "none" or "minimal"
From: Ryan Harris
Sent: Wednesday
Also...
mapreduce.input.fileinputformat.split.maxsize
and, what is the size of your input files?
From: Ryan Harris
Sent: Wednesday, September 30, 2015 2:37 PM
To: 'user@hive.apache.org'
Subject: RE: CombineHiveInputFormat not working
what are your values for:
mapred.min.
what are your values for:
mapred.min.split.size
mapred.max.split.size
hive.hadoop.supports.splittable.combineinputformat
From: Pradeep Gollakota [mailto:pradeep...@gmail.com]
Sent: Wednesday, September 30, 2015 2:20 PM
To: user@hive.apache.org
Subject: CombineHiveInputFormat not working
Hi all,
This may be a bit of 'hack' but I've found that basic select-only operations
will often cause Hive to stream data without running the job through an actual
MR phase. That would typically be a logical approach for a "give me
everything" query if it were not for the UDF...
try adding a basic wh
the fact that you have other data in the column (like letters) implies that you
have the column stored as a string, so use a regex.
SELECT CAST(mycol as BIGINT) WHERE my mycol RLIKE '^-?[0-9.]+$'
From: Mohit Durgapal [mailto:durgapalmo...@gmail.com]
Sent: Wednesday, September 02, 2015 5:09 AM
To
d create bigger files
in necessary.
On Tue, Aug 25, 2015 at 11:57 PM, Ryan Harris
mailto:ryan.har...@zionsbancorp.com>> wrote:
A few things..
1) If you are using spark streaming, I don't see any reason why the output of
your spark streaming can't match the necessary destination
You need to be a bit more clear with your environment and objective here
What is your back-end execution engine? MapReduce, Spark, or Tez?
What are you using for resource management? YARN or MapReduce?
The running time of one query in the presence of other queries will entirely
depend on the
A few things..
1) If you are using spark streaming, I don't see any reason why the output of
your spark streaming can't match the necessary destination format...you
shouldn't need a second job to read the output from Spark Streaming and convert
to parquet. Do a search for spark streaming and la
remember that transform scripts in hive should receive data from STDIN and
return results to STDOUT. So, to properly test your transform script try this:
hive -e "select id from test limit 10" > testout.txt
cat testout.txt | python transform_value.py
if your transform script is working correctl
most are parquet settings
from
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
* # The block size is the size of a row group being buffered in memory
* # this limits the memory usage when writing
* # Larger v
join. No other reason
On Aug 3, 2015 10:47 AM, "Ryan Harris"
mailto:ryan.har...@zionsbancorp.com>> wrote:
Unless you are using bucketing and sampling, there is no benefit (that I can
think of) to informing hive that the data *is* in fact sorted...
If there is something specific y
Unless you are using bucketing and sampling, there is no benefit (that I can
think of) to informing hive that the data *is* in fact sorted...
If there is something specific you are trying to accomplish by specifying the
sort order of that column, perhaps you can elaborate on that. Otherwise, le
You probably want to be using the UDF get_json_object(), I added to this
stackoverflow post
[http://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive]
a few months agothe problem was specific to top-level JSON arrays, and is
related to JIRA HIVE-1575 [https://issues.apache.or
this should get you on the right path:
https://issues.apache.org/jira/browse/HIVE-7121
From: Connell Donaghy [mailto:cdona...@pinterest.com]
Sent: Monday, July 13, 2015 2:50 PM
To: user@hive.apache.org
Subject: DISTRIBUTE BY question
Hey! I'm trying to write a tool which uses a storagehandler to
In hive 0.12, the Abstract Syntax Tree output format when using "EXPLAIN
EXTENDED" matched what is in the wiki:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
As an example, consider the following EXPLAIN query:
EXPLAIN
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.
you *should* be able to do:
create my_table_2 like my_table;
dfs -cp /user/hive/warehouse/my_table/* /user/hive/warehouse/my_table_2/
MSCK repair table my_table_2;
From: Devopam Mittra [mailto:devo...@gmail.com]
Sent: Thursday, June 18, 2015 10:12 PM
To: user@hive.apache.org
Subject: Re: Updati
It looks like the OVER clause currently supports the aggregate functions
(count, sum, min, max, avg, ntile).
Is there any plan to include support for other built-in aggregate functions
like collect_set() ?
==
THIS ELECTRONIC MES
55 matches
Mail list logo